[PATCH 00/34] x86: Memory Protection Keys (v5)

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/34] x86: Memory Protection Keys (v5)
@ 2015-12-04  1:14 ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, Dave Hansen, linux-api, linux-arch, aarcange, akpm,
	jack, kirill.shutemov, n-horiguchi

Memory Protection Keys for User pages is a CPU feature which will
first appear on Skylake Servers, but will also be supported on
future non-server parts.  It provides a mechanism for enforcing
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.  See
the Documentation/ patch for more details.

Changes from v4:

 * Made "allow setting of XSAVE state" safe if we got preempted
   between when we saved our FPU state and when we restore it.
   (I would appreciate a look from Ingo on this patch).
 * Fixed up a few things from Thomas's latest comments: splt up
   siginfo in to x86 and generic, removed extra 'eax' variable
   in rdpkru function, reworked vm_flags assignment, reworded
   a comment in pte_allows_gup()
 * Add missing DISABLED/REQUIRED_MASK14 in cpufeature.h
 * Added comment about compile optimization in fault path
 * Left get_user_pages_locked() alone.  Andrea thinks we need it.

Changes from RFCv3:

 * Added 'current' and 'foreign' variants of get_user_pages() to
   help indicate whether protection keys should be enforced.
   Thanks to Jerome Glisse for pointing out this issue.
 * Added "allocation" and set/get system calls so that we can do
   management of proection keys in the kernel.  This opens the
   door to use of specific protection keys for kernel use in the
   future, such as for execute-only memory.
 * Removed the kselftest code for the moment.  It will be
   submitted separately.

Thanks Ingo and Thomas for most of these):
Changes from RFCv2 (Thanks Ingo and Thomas for most of these):

 * few minor compile warnings
 * changed 'nopku' interaction with cpuid bits.  Now, we do not
   clear the PKU cpuid bit, we just skip enabling it.
 * changed __pkru_allows_write() to also check access disable bit
 * removed the unused write_pkru()
 * made si_pkey a u64 and added some patch description details.
   Also made it share space in siginfo with MPX and clarified
   comments.
 * give some real text for the Processor Trace xsave state
 * made vma_pkey() less ugly (and much more optimized actually)
 * added SEGV_PKUERR to copy_siginfo_to_user()
 * remove page table walk when filling in si_pkey, added some
   big fat comments about it being inherently racy.
 * added self test code

This code is not runnable to anyone outside of Intel unless they
have some special hardware or a fancy simulator.  If you are
interested in running this for real, please get in touch with me.
Hardware is available to a very small but nonzero number of
people.

This set is also available here (with the new syscall):

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v014

=== diffstat ===

Dave Hansen (34):
      mm, gup: introduce concept of "foreign" get_user_pages()
      x86, fpu: add placeholder for Processor Trace XSAVE state
      x86, pkeys: Add Kconfig option
      x86, pkeys: cpuid bit definition
      x86, pkeys: define new CR4 bit
      x86, pkeys: add PKRU xsave fields and data structure(s)
      x86, pkeys: PTE bits for storing protection key
      x86, pkeys: new page fault error code bit: PF_PK
      x86, pkeys: store protection in high VMA flags
      x86, pkeys: arch-specific protection bits
      x86, pkeys: pass VMA down in to fault signal generation code
      signals, pkeys: notify userspace about protection key faults
      x86, pkeys: fill in pkey field in siginfo
      x86, pkeys: add functions to fetch PKRU
      mm: factor out VMA fault permission checking
      x86, mm: simplify get_user_pages() PTE bit handling
      x86, pkeys: check VMAs and PTEs for protection keys
      mm: add gup flag to indicate "foreign" mm access
      x86, pkeys: optimize fault handling in access_error()
      x86, pkeys: differentiate instruction fetches
      x86, pkeys: dump PKRU with other kernel registers
      x86, pkeys: dump PTE pkey in /proc/pid/smaps
      x86, pkeys: add Kconfig prompt to existing config option
      mm, multi-arch: pass a protection key in to calc_vm_flag_bits()
      x86, pkeys: add arch_validate_pkey()
      mm: implement new mprotect_key() system call
      x86, pkeys: make mprotect_key() mask off additional vm_flags
      x86: wire up mprotect_key() system call
      x86: separate out LDT init from context init
      x86, fpu: allow setting of XSAVE state
      x86, pkeys: allocation/free syscalls
      x86, pkeys: add pkey set/get syscalls
      x86, pkeys: actually enable Memory Protection Keys in CPU
      x86, pkeys: Documentation

 Documentation/kernel-parameters.txt         |   3 +
 Documentation/x86/protection-keys.txt       |  53 +++++
 arch/mips/mm/gup.c                          |   3 +-
 arch/powerpc/include/asm/mman.h             |   5 +-
 arch/powerpc/include/asm/mmu_context.h      |  12 +
 arch/s390/include/asm/mmu_context.h         |  12 +
 arch/s390/mm/gup.c                          |   3 +-
 arch/sh/mm/gup.c                            |   2 +-
 arch/sparc/mm/gup.c                         |   2 +-
 arch/unicore32/include/asm/mmu_context.h    |  12 +
 arch/x86/Kconfig                            |  16 ++
 arch/x86/entry/syscalls/syscall_32.tbl      |   5 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   5 +
 arch/x86/include/asm/cpufeature.h           |  56 +++--
 arch/x86/include/asm/disabled-features.h    |  13 ++
 arch/x86/include/asm/fpu/internal.h         |   2 +
 arch/x86/include/asm/fpu/types.h            |  12 +
 arch/x86/include/asm/fpu/xstate.h           |   4 +-
 arch/x86/include/asm/mmu.h                  |   7 +
 arch/x86/include/asm/mmu_context.h          | 110 ++++++++-
 arch/x86/include/asm/pgtable.h              |  38 +++
 arch/x86/include/asm/pgtable_types.h        |  34 ++-
 arch/x86/include/asm/pkeys.h                |  67 ++++++
 arch/x86/include/asm/required-features.h    |   5 +
 arch/x86/include/asm/special_insns.h        |  22 ++
 arch/x86/include/uapi/asm/mman.h            |  22 ++
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/cpu/common.c                |  42 ++++
 arch/x86/kernel/fpu/core.c                  |  63 +++++
 arch/x86/kernel/fpu/xstate.c                | 241 +++++++++++++++++++-
 arch/x86/kernel/ldt.c                       |   4 +-
 arch/x86/kernel/process_64.c                |   2 +
 arch/x86/kernel/setup.c                     |   9 +
 arch/x86/mm/fault.c                         | 158 +++++++++++--
 arch/x86/mm/gup.c                           |  51 +++--
 arch/x86/mm/mpx.c                           |   4 +-
 drivers/char/agp/frontend.c                 |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     |   4 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c     |   2 +-
 drivers/gpu/drm/radeon/radeon_ttm.c         |   4 +-
 drivers/gpu/drm/via/via_dmablit.c           |   3 +-
 drivers/infiniband/core/umem.c              |   2 +-
 drivers/infiniband/core/umem_odp.c          |   8 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   3 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |   3 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |   2 +-
 drivers/iommu/amd_iommu_v2.c                |   8 +-
 drivers/media/pci/ivtv/ivtv-udma.c          |   4 +-
 drivers/media/pci/ivtv/ivtv-yuv.c           |  10 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |   3 +-
 drivers/misc/sgi-gru/grufault.c             |   3 +-
 drivers/scsi/st.c                           |   2 -
 drivers/staging/android/ashmem.c            |   4 +-
 drivers/video/fbdev/pvr2fb.c                |   4 +-
 drivers/virt/fsl_hypervisor.c               |   5 +-
 fs/exec.c                                   |   8 +-
 fs/proc/task_mmu.c                          |   5 +
 include/asm-generic/mm_hooks.h              |  12 +
 include/linux/mm.h                          |  55 ++++-
 include/linux/mman.h                        |   6 +-
 include/linux/pkeys.h                       |  59 +++++
 include/uapi/asm-generic/mman-common.h      |   5 +
 include/uapi/asm-generic/siginfo.h          |  17 +-
 kernel/events/uprobes.c                     |   4 +-
 kernel/signal.c                             |   4 +
 mm/Kconfig                                  |  13 ++
 mm/frame_vector.c                           |   2 +-
 mm/gup.c                                    |  93 ++++++--
 mm/ksm.c                                    |  10 +-
 mm/memory.c                                 |   8 +-
 mm/mempolicy.c                              |   6 +-
 mm/mmap.c                                   |   2 +-
 mm/mprotect.c                               | 136 ++++++++++-
 mm/nommu.c                                  |  35 ++-
 mm/process_vm_access.c                      |   6 +-
 mm/util.c                                   |   4 +-
 net/ceph/pagevec.c                          |   2 +-
 security/tomoyo/domain.c                    |   9 +-
 virt/kvm/async_pf.c                         |   2 +-
 virt/kvm/kvm_main.c                         |  13 +-
 80 files changed, 1470 insertions(+), 223 deletions(-)

Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: aarcange@redhat.com
Cc: akpm@linux-foundation.org
Cc: jack@suse.cz
Cc: kirill.shutemov@linux.intel.com
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: n-horiguchi@ah.jp.nec.com
Cc: x86@kernel.org

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 00/34] x86: Memory Protection Keys (v5)
@ 2015-12-04  1:14 ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, Dave Hansen, linux-api, linux-arch, aarcange, akpm,
	jack, kirill.shutemov, n-horiguchi

Memory Protection Keys for User pages is a CPU feature which will
first appear on Skylake Servers, but will also be supported on
future non-server parts.  It provides a mechanism for enforcing
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.  See
the Documentation/ patch for more details.

Changes from v4:

 * Made "allow setting of XSAVE state" safe if we got preempted
   between when we saved our FPU state and when we restore it.
   (I would appreciate a look from Ingo on this patch).
 * Fixed up a few things from Thomas's latest comments: splt up
   siginfo in to x86 and generic, removed extra 'eax' variable
   in rdpkru function, reworked vm_flags assignment, reworded
   a comment in pte_allows_gup()
 * Add missing DISABLED/REQUIRED_MASK14 in cpufeature.h
 * Added comment about compile optimization in fault path
 * Left get_user_pages_locked() alone.  Andrea thinks we need it.

Changes from RFCv3:

 * Added 'current' and 'foreign' variants of get_user_pages() to
   help indicate whether protection keys should be enforced.
   Thanks to Jerome Glisse for pointing out this issue.
 * Added "allocation" and set/get system calls so that we can do
   management of proection keys in the kernel.  This opens the
   door to use of specific protection keys for kernel use in the
   future, such as for execute-only memory.
 * Removed the kselftest code for the moment.  It will be
   submitted separately.

Thanks Ingo and Thomas for most of these):
Changes from RFCv2 (Thanks Ingo and Thomas for most of these):

 * few minor compile warnings
 * changed 'nopku' interaction with cpuid bits.  Now, we do not
   clear the PKU cpuid bit, we just skip enabling it.
 * changed __pkru_allows_write() to also check access disable bit
 * removed the unused write_pkru()
 * made si_pkey a u64 and added some patch description details.
   Also made it share space in siginfo with MPX and clarified
   comments.
 * give some real text for the Processor Trace xsave state
 * made vma_pkey() less ugly (and much more optimized actually)
 * added SEGV_PKUERR to copy_siginfo_to_user()
 * remove page table walk when filling in si_pkey, added some
   big fat comments about it being inherently racy.
 * added self test code

This code is not runnable to anyone outside of Intel unless they
have some special hardware or a fancy simulator.  If you are
interested in running this for real, please get in touch with me.
Hardware is available to a very small but nonzero number of
people.

This set is also available here (with the new syscall):

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v014

=== diffstat ===

Dave Hansen (34):
      mm, gup: introduce concept of "foreign" get_user_pages()
      x86, fpu: add placeholder for Processor Trace XSAVE state
      x86, pkeys: Add Kconfig option
      x86, pkeys: cpuid bit definition
      x86, pkeys: define new CR4 bit
      x86, pkeys: add PKRU xsave fields and data structure(s)
      x86, pkeys: PTE bits for storing protection key
      x86, pkeys: new page fault error code bit: PF_PK
      x86, pkeys: store protection in high VMA flags
      x86, pkeys: arch-specific protection bits
      x86, pkeys: pass VMA down in to fault signal generation code
      signals, pkeys: notify userspace about protection key faults
      x86, pkeys: fill in pkey field in siginfo
      x86, pkeys: add functions to fetch PKRU
      mm: factor out VMA fault permission checking
      x86, mm: simplify get_user_pages() PTE bit handling
      x86, pkeys: check VMAs and PTEs for protection keys
      mm: add gup flag to indicate "foreign" mm access
      x86, pkeys: optimize fault handling in access_error()
      x86, pkeys: differentiate instruction fetches
      x86, pkeys: dump PKRU with other kernel registers
      x86, pkeys: dump PTE pkey in /proc/pid/smaps
      x86, pkeys: add Kconfig prompt to existing config option
      mm, multi-arch: pass a protection key in to calc_vm_flag_bits()
      x86, pkeys: add arch_validate_pkey()
      mm: implement new mprotect_key() system call
      x86, pkeys: make mprotect_key() mask off additional vm_flags
      x86: wire up mprotect_key() system call
      x86: separate out LDT init from context init
      x86, fpu: allow setting of XSAVE state
      x86, pkeys: allocation/free syscalls
      x86, pkeys: add pkey set/get syscalls
      x86, pkeys: actually enable Memory Protection Keys in CPU
      x86, pkeys: Documentation

 Documentation/kernel-parameters.txt         |   3 +
 Documentation/x86/protection-keys.txt       |  53 +++++
 arch/mips/mm/gup.c                          |   3 +-
 arch/powerpc/include/asm/mman.h             |   5 +-
 arch/powerpc/include/asm/mmu_context.h      |  12 +
 arch/s390/include/asm/mmu_context.h         |  12 +
 arch/s390/mm/gup.c                          |   3 +-
 arch/sh/mm/gup.c                            |   2 +-
 arch/sparc/mm/gup.c                         |   2 +-
 arch/unicore32/include/asm/mmu_context.h    |  12 +
 arch/x86/Kconfig                            |  16 ++
 arch/x86/entry/syscalls/syscall_32.tbl      |   5 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   5 +
 arch/x86/include/asm/cpufeature.h           |  56 +++--
 arch/x86/include/asm/disabled-features.h    |  13 ++
 arch/x86/include/asm/fpu/internal.h         |   2 +
 arch/x86/include/asm/fpu/types.h            |  12 +
 arch/x86/include/asm/fpu/xstate.h           |   4 +-
 arch/x86/include/asm/mmu.h                  |   7 +
 arch/x86/include/asm/mmu_context.h          | 110 ++++++++-
 arch/x86/include/asm/pgtable.h              |  38 +++
 arch/x86/include/asm/pgtable_types.h        |  34 ++-
 arch/x86/include/asm/pkeys.h                |  67 ++++++
 arch/x86/include/asm/required-features.h    |   5 +
 arch/x86/include/asm/special_insns.h        |  22 ++
 arch/x86/include/uapi/asm/mman.h            |  22 ++
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/cpu/common.c                |  42 ++++
 arch/x86/kernel/fpu/core.c                  |  63 +++++
 arch/x86/kernel/fpu/xstate.c                | 241 +++++++++++++++++++-
 arch/x86/kernel/ldt.c                       |   4 +-
 arch/x86/kernel/process_64.c                |   2 +
 arch/x86/kernel/setup.c                     |   9 +
 arch/x86/mm/fault.c                         | 158 +++++++++++--
 arch/x86/mm/gup.c                           |  51 +++--
 arch/x86/mm/mpx.c                           |   4 +-
 drivers/char/agp/frontend.c                 |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     |   4 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c     |   2 +-
 drivers/gpu/drm/radeon/radeon_ttm.c         |   4 +-
 drivers/gpu/drm/via/via_dmablit.c           |   3 +-
 drivers/infiniband/core/umem.c              |   2 +-
 drivers/infiniband/core/umem_odp.c          |   8 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   3 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |   3 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |   2 +-
 drivers/iommu/amd_iommu_v2.c                |   8 +-
 drivers/media/pci/ivtv/ivtv-udma.c          |   4 +-
 drivers/media/pci/ivtv/ivtv-yuv.c           |  10 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |   3 +-
 drivers/misc/sgi-gru/grufault.c             |   3 +-
 drivers/scsi/st.c                           |   2 -
 drivers/staging/android/ashmem.c            |   4 +-
 drivers/video/fbdev/pvr2fb.c                |   4 +-
 drivers/virt/fsl_hypervisor.c               |   5 +-
 fs/exec.c                                   |   8 +-
 fs/proc/task_mmu.c                          |   5 +
 include/asm-generic/mm_hooks.h              |  12 +
 include/linux/mm.h                          |  55 ++++-
 include/linux/mman.h                        |   6 +-
 include/linux/pkeys.h                       |  59 +++++
 include/uapi/asm-generic/mman-common.h      |   5 +
 include/uapi/asm-generic/siginfo.h          |  17 +-
 kernel/events/uprobes.c                     |   4 +-
 kernel/signal.c                             |   4 +
 mm/Kconfig                                  |  13 ++
 mm/frame_vector.c                           |   2 +-
 mm/gup.c                                    |  93 ++++++--
 mm/ksm.c                                    |  10 +-
 mm/memory.c                                 |   8 +-
 mm/mempolicy.c                              |   6 +-
 mm/mmap.c                                   |   2 +-
 mm/mprotect.c                               | 136 ++++++++++-
 mm/nommu.c                                  |  35 ++-
 mm/process_vm_access.c                      |   6 +-
 mm/util.c                                   |   4 +-
 net/ceph/pagevec.c                          |   2 +-
 security/tomoyo/domain.c                    |   9 +-
 virt/kvm/async_pf.c                         |   2 +-
 virt/kvm/kvm_main.c                         |  13 +-
 80 files changed, 1470 insertions(+), 223 deletions(-)

Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: aarcange@redhat.com
Cc: akpm@linux-foundation.org
Cc: jack@suse.cz
Cc: kirill.shutemov@linux.intel.com
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: n-horiguchi@ah.jp.nec.com
Cc: x86@kernel.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 00/34] x86: Memory Protection Keys (v5)
@ 2015-12-04  1:14 ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Dave Hansen, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	aarcange-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, jack-AlSwsSmVLrQ,
	kirill.shutemov-VuQAYsv1563Yd54FQh9/CA,
	n-horiguchi-PaJj6Psr51x8UrSeD/g0lQ

Memory Protection Keys for User pages is a CPU feature which will
first appear on Skylake Servers, but will also be supported on
future non-server parts.  It provides a mechanism for enforcing
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.  See
the Documentation/ patch for more details.

Changes from v4:

 * Made "allow setting of XSAVE state" safe if we got preempted
   between when we saved our FPU state and when we restore it.
   (I would appreciate a look from Ingo on this patch).
 * Fixed up a few things from Thomas's latest comments: splt up
   siginfo in to x86 and generic, removed extra 'eax' variable
   in rdpkru function, reworked vm_flags assignment, reworded
   a comment in pte_allows_gup()
 * Add missing DISABLED/REQUIRED_MASK14 in cpufeature.h
 * Added comment about compile optimization in fault path
 * Left get_user_pages_locked() alone.  Andrea thinks we need it.

Changes from RFCv3:

 * Added 'current' and 'foreign' variants of get_user_pages() to
   help indicate whether protection keys should be enforced.
   Thanks to Jerome Glisse for pointing out this issue.
 * Added "allocation" and set/get system calls so that we can do
   management of proection keys in the kernel.  This opens the
   door to use of specific protection keys for kernel use in the
   future, such as for execute-only memory.
 * Removed the kselftest code for the moment.  It will be
   submitted separately.

Thanks Ingo and Thomas for most of these):
Changes from RFCv2 (Thanks Ingo and Thomas for most of these):

 * few minor compile warnings
 * changed 'nopku' interaction with cpuid bits.  Now, we do not
   clear the PKU cpuid bit, we just skip enabling it.
 * changed __pkru_allows_write() to also check access disable bit
 * removed the unused write_pkru()
 * made si_pkey a u64 and added some patch description details.
   Also made it share space in siginfo with MPX and clarified
   comments.
 * give some real text for the Processor Trace xsave state
 * made vma_pkey() less ugly (and much more optimized actually)
 * added SEGV_PKUERR to copy_siginfo_to_user()
 * remove page table walk when filling in si_pkey, added some
   big fat comments about it being inherently racy.
 * added self test code

This code is not runnable to anyone outside of Intel unless they
have some special hardware or a fancy simulator.  If you are
interested in running this for real, please get in touch with me.
Hardware is available to a very small but nonzero number of
people.

This set is also available here (with the new syscall):

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v014

=== diffstat ===

Dave Hansen (34):
      mm, gup: introduce concept of "foreign" get_user_pages()
      x86, fpu: add placeholder for Processor Trace XSAVE state
      x86, pkeys: Add Kconfig option
      x86, pkeys: cpuid bit definition
      x86, pkeys: define new CR4 bit
      x86, pkeys: add PKRU xsave fields and data structure(s)
      x86, pkeys: PTE bits for storing protection key
      x86, pkeys: new page fault error code bit: PF_PK
      x86, pkeys: store protection in high VMA flags
      x86, pkeys: arch-specific protection bits
      x86, pkeys: pass VMA down in to fault signal generation code
      signals, pkeys: notify userspace about protection key faults
      x86, pkeys: fill in pkey field in siginfo
      x86, pkeys: add functions to fetch PKRU
      mm: factor out VMA fault permission checking
      x86, mm: simplify get_user_pages() PTE bit handling
      x86, pkeys: check VMAs and PTEs for protection keys
      mm: add gup flag to indicate "foreign" mm access
      x86, pkeys: optimize fault handling in access_error()
      x86, pkeys: differentiate instruction fetches
      x86, pkeys: dump PKRU with other kernel registers
      x86, pkeys: dump PTE pkey in /proc/pid/smaps
      x86, pkeys: add Kconfig prompt to existing config option
      mm, multi-arch: pass a protection key in to calc_vm_flag_bits()
      x86, pkeys: add arch_validate_pkey()
      mm: implement new mprotect_key() system call
      x86, pkeys: make mprotect_key() mask off additional vm_flags
      x86: wire up mprotect_key() system call
      x86: separate out LDT init from context init
      x86, fpu: allow setting of XSAVE state
      x86, pkeys: allocation/free syscalls
      x86, pkeys: add pkey set/get syscalls
      x86, pkeys: actually enable Memory Protection Keys in CPU
      x86, pkeys: Documentation

 Documentation/kernel-parameters.txt         |   3 +
 Documentation/x86/protection-keys.txt       |  53 +++++
 arch/mips/mm/gup.c                          |   3 +-
 arch/powerpc/include/asm/mman.h             |   5 +-
 arch/powerpc/include/asm/mmu_context.h      |  12 +
 arch/s390/include/asm/mmu_context.h         |  12 +
 arch/s390/mm/gup.c                          |   3 +-
 arch/sh/mm/gup.c                            |   2 +-
 arch/sparc/mm/gup.c                         |   2 +-
 arch/unicore32/include/asm/mmu_context.h    |  12 +
 arch/x86/Kconfig                            |  16 ++
 arch/x86/entry/syscalls/syscall_32.tbl      |   5 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   5 +
 arch/x86/include/asm/cpufeature.h           |  56 +++--
 arch/x86/include/asm/disabled-features.h    |  13 ++
 arch/x86/include/asm/fpu/internal.h         |   2 +
 arch/x86/include/asm/fpu/types.h            |  12 +
 arch/x86/include/asm/fpu/xstate.h           |   4 +-
 arch/x86/include/asm/mmu.h                  |   7 +
 arch/x86/include/asm/mmu_context.h          | 110 ++++++++-
 arch/x86/include/asm/pgtable.h              |  38 +++
 arch/x86/include/asm/pgtable_types.h        |  34 ++-
 arch/x86/include/asm/pkeys.h                |  67 ++++++
 arch/x86/include/asm/required-features.h    |   5 +
 arch/x86/include/asm/special_insns.h        |  22 ++
 arch/x86/include/uapi/asm/mman.h            |  22 ++
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/cpu/common.c                |  42 ++++
 arch/x86/kernel/fpu/core.c                  |  63 +++++
 arch/x86/kernel/fpu/xstate.c                | 241 +++++++++++++++++++-
 arch/x86/kernel/ldt.c                       |   4 +-
 arch/x86/kernel/process_64.c                |   2 +
 arch/x86/kernel/setup.c                     |   9 +
 arch/x86/mm/fault.c                         | 158 +++++++++++--
 arch/x86/mm/gup.c                           |  51 +++--
 arch/x86/mm/mpx.c                           |   4 +-
 drivers/char/agp/frontend.c                 |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     |   4 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c     |   2 +-
 drivers/gpu/drm/radeon/radeon_ttm.c         |   4 +-
 drivers/gpu/drm/via/via_dmablit.c           |   3 +-
 drivers/infiniband/core/umem.c              |   2 +-
 drivers/infiniband/core/umem_odp.c          |   8 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   3 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |   3 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |   2 +-
 drivers/iommu/amd_iommu_v2.c                |   8 +-
 drivers/media/pci/ivtv/ivtv-udma.c          |   4 +-
 drivers/media/pci/ivtv/ivtv-yuv.c           |  10 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |   3 +-
 drivers/misc/sgi-gru/grufault.c             |   3 +-
 drivers/scsi/st.c                           |   2 -
 drivers/staging/android/ashmem.c            |   4 +-
 drivers/video/fbdev/pvr2fb.c                |   4 +-
 drivers/virt/fsl_hypervisor.c               |   5 +-
 fs/exec.c                                   |   8 +-
 fs/proc/task_mmu.c                          |   5 +
 include/asm-generic/mm_hooks.h              |  12 +
 include/linux/mm.h                          |  55 ++++-
 include/linux/mman.h                        |   6 +-
 include/linux/pkeys.h                       |  59 +++++
 include/uapi/asm-generic/mman-common.h      |   5 +
 include/uapi/asm-generic/siginfo.h          |  17 +-
 kernel/events/uprobes.c                     |   4 +-
 kernel/signal.c                             |   4 +
 mm/Kconfig                                  |  13 ++
 mm/frame_vector.c                           |   2 +-
 mm/gup.c                                    |  93 ++++++--
 mm/ksm.c                                    |  10 +-
 mm/memory.c                                 |   8 +-
 mm/mempolicy.c                              |   6 +-
 mm/mmap.c                                   |   2 +-
 mm/mprotect.c                               | 136 ++++++++++-
 mm/nommu.c                                  |  35 ++-
 mm/process_vm_access.c                      |   6 +-
 mm/util.c                                   |   4 +-
 net/ceph/pagevec.c                          |   2 +-
 security/tomoyo/domain.c                    |   9 +-
 virt/kvm/async_pf.c                         |   2 +-
 virt/kvm/kvm_main.c                         |  13 +-
 80 files changed, 1470 insertions(+), 223 deletions(-)

Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-arch-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org
Cc: jack-AlSwsSmVLrQ@public.gmane.org
Cc: kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org
Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-arch-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: n-horiguchi-PaJj6Psr51x8UrSeD/g0lQ@public.gmane.org
Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 01/34] mm, gup: introduce concept of "foreign" get_user_pages()
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, Dave Hansen, dave.hansen, akpm, kirill.shutemov,
	aarcange, n-horiguchi


From: Dave Hansen <dave.hansen@linux.intel.com>

For protection keys, we need to understand whether protections
should be enforced in software or not.  In general, we enforce
protections when working on our own task, but not when on others.
We call these "current" and "foreign" operations.

This introduces two new get_user_pages() variants:

	get_current_user_pages()
	get_foreign_user_pages()

get_current_user_pages() is a drop-in replacement for when
get_user_pages() was called with (current, current->mm, ...) as
arguments.  Using it makes a few of the call sites look a bit
nicer.

get_foreign_user_pages() is a replacement for when
get_user_pages() is called on non-current tsk/mm.

We leave a stub get_user_pages() around with a __deprecated
warning.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---

 b/arch/mips/mm/gup.c                          |    3 -
 b/arch/s390/mm/gup.c                          |    3 -
 b/arch/sh/mm/gup.c                            |    2 -
 b/arch/sparc/mm/gup.c                         |    2 -
 b/arch/x86/mm/gup.c                           |    2 -
 b/arch/x86/mm/mpx.c                           |    4 +-
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     |    4 +-
 b/drivers/gpu/drm/i915/i915_gem_userptr.c     |    2 -
 b/drivers/gpu/drm/radeon/radeon_ttm.c         |    4 +-
 b/drivers/gpu/drm/via/via_dmablit.c           |    3 -
 b/drivers/infiniband/core/umem.c              |    2 -
 b/drivers/infiniband/core/umem_odp.c          |    8 ++--
 b/drivers/infiniband/hw/mthca/mthca_memfree.c |    3 -
 b/drivers/infiniband/hw/qib/qib_user_pages.c  |    3 -
 b/drivers/infiniband/hw/usnic/usnic_uiom.c    |    2 -
 b/drivers/media/pci/ivtv/ivtv-udma.c          |    4 +-
 b/drivers/media/pci/ivtv/ivtv-yuv.c           |   10 ++---
 b/drivers/media/v4l2-core/videobuf-dma-sg.c   |    3 -
 b/drivers/misc/sgi-gru/grufault.c             |    3 -
 b/drivers/scsi/st.c                           |    2 -
 b/drivers/video/fbdev/pvr2fb.c                |    4 +-
 b/drivers/virt/fsl_hypervisor.c               |    5 +-
 b/fs/exec.c                                   |    8 +++-
 b/include/linux/mm.h                          |   39 +++++++++++++------
 b/kernel/events/uprobes.c                     |    4 +-
 b/mm/frame_vector.c                           |    2 -
 b/mm/gup.c                                    |   51 ++++++++++++++++----------
 b/mm/memory.c                                 |    2 -
 b/mm/mempolicy.c                              |    6 +--
 b/mm/nommu.c                                  |   34 ++++++++++-------
 b/mm/process_vm_access.c                      |    6 ++-
 b/mm/util.c                                   |    4 --
 b/net/ceph/pagevec.c                          |    2 -
 b/security/tomoyo/domain.c                    |    9 ++++
 b/virt/kvm/async_pf.c                         |    2 -
 b/virt/kvm/kvm_main.c                         |   13 +++---
 36 files changed, 147 insertions(+), 113 deletions(-)

diff -puN arch/mips/mm/gup.c~get_current_user_pages arch/mips/mm/gup.c
--- a/arch/mips/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.700311841 -0800
+++ b/arch/mips/mm/gup.c	2015-12-03 16:21:17.762314653 -0800
@@ -301,8 +301,7 @@ slow_irqon:
 	start += nr << PAGE_SHIFT;
 	pages += nr;
 
-	ret = get_user_pages_unlocked(current, mm, start,
-				      (end - start) >> PAGE_SHIFT,
+	ret = get_user_pages_unlocked(start, (end - start) >> PAGE_SHIFT,
 				      write, 0, pages);
 
 	/* Have to be a bit careful with return values */
diff -puN arch/s390/mm/gup.c~get_current_user_pages arch/s390/mm/gup.c
--- a/arch/s390/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.701311886 -0800
+++ b/arch/s390/mm/gup.c	2015-12-03 16:21:17.762314653 -0800
@@ -241,8 +241,7 @@ int get_user_pages_fast(unsigned long st
 	/* Try to get the remaining pages with get_user_pages */
 	start += nr << PAGE_SHIFT;
 	pages += nr;
-	ret = get_user_pages_unlocked(current, mm, start,
-			     nr_pages - nr, write, 0, pages);
+	ret = get_user_pages_unlocked(start, nr_pages - nr, write, 0, pages);
 	/* Have to be a bit careful with return values */
 	if (nr > 0)
 		ret = (ret < 0) ? nr : ret + nr;
diff -puN arch/sh/mm/gup.c~get_current_user_pages arch/sh/mm/gup.c
--- a/arch/sh/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.703311977 -0800
+++ b/arch/sh/mm/gup.c	2015-12-03 16:21:17.762314653 -0800
@@ -257,7 +257,7 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
diff -puN arch/sparc/mm/gup.c~get_current_user_pages arch/sparc/mm/gup.c
--- a/arch/sparc/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.704312023 -0800
+++ b/arch/sparc/mm/gup.c	2015-12-03 16:21:17.763314698 -0800
@@ -249,7 +249,7 @@ slow:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
diff -puN arch/x86/mm/gup.c~get_current_user_pages arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.706312113 -0800
+++ b/arch/x86/mm/gup.c	2015-12-03 16:21:17.763314698 -0800
@@ -386,7 +386,7 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 					      (end - start) >> PAGE_SHIFT,
 					      write, 0, pages);
 
diff -puN arch/x86/mm/mpx.c~get_current_user_pages arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~get_current_user_pages	2015-12-03 16:21:17.708312204 -0800
+++ b/arch/x86/mm/mpx.c	2015-12-03 16:21:17.763314698 -0800
@@ -546,8 +546,8 @@ static int mpx_resolve_fault(long __user
 	int nr_pages = 1;
 	int force = 0;
 
-	gup_ret = get_user_pages(current, current->mm, (unsigned long)addr,
-				 nr_pages, write, force, NULL, NULL);
+	gup_ret = get_current_user_pages((unsigned long)addr, nr_pages, write,
+			force, NULL, NULL);
 	/*
 	 * get_user_pages() returns number of pages gotten.
 	 * 0 means we failed to fault in and get anything,
diff -puN drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c~get_current_user_pages drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c~get_current_user_pages	2015-12-03 16:21:17.709312249 -0800
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c	2015-12-03 16:21:17.764314744 -0800
@@ -518,8 +518,8 @@ static int amdgpu_ttm_tt_pin_userptr(str
 		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
 		struct page **pages = ttm->pages + pinned;
 
-		r = get_user_pages(current, current->mm, userptr, num_pages,
-				   write, 0, pages, NULL);
+		r = get_current_user_pages(userptr, num_pages, write, 0, pages,
+				NULL);
 		if (r < 0)
 			goto release_pages;
 
diff -puN drivers/gpu/drm/i915/i915_gem_userptr.c~get_current_user_pages drivers/gpu/drm/i915/i915_gem_userptr.c
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c~get_current_user_pages	2015-12-03 16:21:17.711312340 -0800
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c	2015-12-03 16:21:17.764314744 -0800
@@ -587,7 +587,7 @@ __i915_gem_userptr_get_pages_worker(stru
 
 		down_read(&mm->mmap_sem);
 		while (pinned < num_pages) {
-			ret = get_user_pages(work->task, mm,
+			ret = get_foreign_user_pages(work->task, mm,
 					     obj->userptr.ptr + pinned * PAGE_SIZE,
 					     num_pages - pinned,
 					     !obj->userptr.read_only, 0,
diff -puN drivers/gpu/drm/radeon/radeon_ttm.c~get_current_user_pages drivers/gpu/drm/radeon/radeon_ttm.c
--- a/drivers/gpu/drm/radeon/radeon_ttm.c~get_current_user_pages	2015-12-03 16:21:17.713312431 -0800
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c	2015-12-03 16:21:17.765314789 -0800
@@ -554,8 +554,8 @@ static int radeon_ttm_tt_pin_userptr(str
 		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
 		struct page **pages = ttm->pages + pinned;
 
-		r = get_user_pages(current, current->mm, userptr, num_pages,
-				   write, 0, pages, NULL);
+		r = get_current_user_pages(userptr, num_pages, write, 0, pages,
+				NULL);
 		if (r < 0)
 			goto release_pages;
 
diff -puN drivers/gpu/drm/via/via_dmablit.c~get_current_user_pages drivers/gpu/drm/via/via_dmablit.c
--- a/drivers/gpu/drm/via/via_dmablit.c~get_current_user_pages	2015-12-03 16:21:17.714312476 -0800
+++ b/drivers/gpu/drm/via/via_dmablit.c	2015-12-03 16:21:17.765314789 -0800
@@ -239,8 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t
 	if (NULL == vsg->pages)
 		return -ENOMEM;
 	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(current, current->mm,
-			     (unsigned long)xfer->mem_addr,
+	ret = get_current_user_pages((unsigned long)xfer->mem_addr,
 			     vsg->num_pages,
 			     (vsg->direction == DMA_FROM_DEVICE),
 			     0, vsg->pages, NULL);
diff -puN drivers/infiniband/core/umem.c~get_current_user_pages drivers/infiniband/core/umem.c
--- a/drivers/infiniband/core/umem.c~get_current_user_pages	2015-12-03 16:21:17.716312567 -0800
+++ b/drivers/infiniband/core/umem.c	2015-12-03 16:21:17.766314834 -0800
@@ -188,7 +188,7 @@ struct ib_umem *ib_umem_get(struct ib_uc
 	sg_list_start = umem->sg_head.sgl;
 
 	while (npages) {
-		ret = get_user_pages(current, current->mm, cur_base,
+		ret = get_current_user_pages(cur_base,
 				     min_t(unsigned long, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
 				     1, !umem->writable, page_list, vma_list);
diff -puN drivers/infiniband/core/umem_odp.c~get_current_user_pages drivers/infiniband/core/umem_odp.c
--- a/drivers/infiniband/core/umem_odp.c~get_current_user_pages	2015-12-03 16:21:17.718312657 -0800
+++ b/drivers/infiniband/core/umem_odp.c	2015-12-03 16:21:17.766314834 -0800
@@ -572,10 +572,10 @@ int ib_umem_odp_map_dma_pages(struct ib_
 		 * complex (and doesn't gain us much performance in most use
 		 * cases).
 		 */
-		npages = get_user_pages(owning_process, owning_mm, user_virt,
-					gup_num_pages,
-					access_mask & ODP_WRITE_ALLOWED_BIT, 0,
-					local_page_list, NULL);
+		npages = get_foreign_user_pages(owning_process, owning_mm,
+				user_virt, gup_num_pages,
+				access_mask & ODP_WRITE_ALLOWED_BIT,
+				0, local_page_list, NULL);
 		up_read(&owning_mm->mmap_sem);
 
 		if (npages < 0)
diff -puN drivers/infiniband/hw/mthca/mthca_memfree.c~get_current_user_pages drivers/infiniband/hw/mthca/mthca_memfree.c
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c~get_current_user_pages	2015-12-03 16:21:17.719312703 -0800
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c	2015-12-03 16:21:17.767314880 -0800
@@ -472,8 +472,7 @@ int mthca_map_user_db(struct mthca_dev *
 		goto out;
 	}
 
-	ret = get_user_pages(current, current->mm, uaddr & PAGE_MASK, 1, 1, 0,
-			     pages, NULL);
+	ret = get_current_user_pages(uaddr & PAGE_MASK, 1, 1, 0, pages, NULL);
 	if (ret < 0)
 		goto out;
 
diff -puN drivers/infiniband/hw/qib/qib_user_pages.c~get_current_user_pages drivers/infiniband/hw/qib/qib_user_pages.c
--- a/drivers/infiniband/hw/qib/qib_user_pages.c~get_current_user_pages	2015-12-03 16:21:17.721312794 -0800
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c	2015-12-03 16:21:17.767314880 -0800
@@ -66,8 +66,7 @@ static int __qib_get_user_pages(unsigned
 	}
 
 	for (got = 0; got < num_pages; got += ret) {
-		ret = get_user_pages(current, current->mm,
-				     start_page + got * PAGE_SIZE,
+		ret = get_current_user_pages(start_page + got * PAGE_SIZE,
 				     num_pages - got, 1, 1,
 				     p + got, NULL);
 		if (ret < 0)
diff -puN drivers/infiniband/hw/usnic/usnic_uiom.c~get_current_user_pages drivers/infiniband/hw/usnic/usnic_uiom.c
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c~get_current_user_pages	2015-12-03 16:21:17.723312884 -0800
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c	2015-12-03 16:21:17.767314880 -0800
@@ -144,7 +144,7 @@ static int usnic_uiom_get_pages(unsigned
 	ret = 0;
 
 	while (npages) {
-		ret = get_user_pages(current, current->mm, cur_base,
+		ret = get_current_user_pages(cur_base,
 					min_t(unsigned long, npages,
 					PAGE_SIZE / sizeof(struct page *)),
 					1, !writable, page_list, NULL);
diff -puN drivers/media/pci/ivtv/ivtv-udma.c~get_current_user_pages drivers/media/pci/ivtv/ivtv-udma.c
--- a/drivers/media/pci/ivtv/ivtv-udma.c~get_current_user_pages	2015-12-03 16:21:17.724312930 -0800
+++ b/drivers/media/pci/ivtv/ivtv-udma.c	2015-12-03 16:21:17.768314925 -0800
@@ -124,8 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, un
 	}
 
 	/* Get user pages for DMA Xfer */
-	err = get_user_pages_unlocked(current, current->mm,
-			user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
+	err = get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, 0,
+			1, dma->map);
 
 	if (user_dma.page_count != err) {
 		IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
diff -puN drivers/media/pci/ivtv/ivtv-yuv.c~get_current_user_pages drivers/media/pci/ivtv/ivtv-yuv.c
--- a/drivers/media/pci/ivtv/ivtv-yuv.c~get_current_user_pages	2015-12-03 16:21:17.726313020 -0800
+++ b/drivers/media/pci/ivtv/ivtv-yuv.c	2015-12-03 16:21:17.768314925 -0800
@@ -75,14 +75,12 @@ static int ivtv_yuv_prep_user_dma(struct
 	ivtv_udma_get_page_info (&uv_dma, (unsigned long)args->uv_source, 360 * uv_decode_height);
 
 	/* Get user pages for DMA Xfer */
-	y_pages = get_user_pages_unlocked(current, current->mm,
-				y_dma.uaddr, y_dma.page_count, 0, 1,
-				&dma->map[0]);
+	y_pages = get_user_pages_unlocked(y_dma.uaddr,
+			y_dma.page_count, 0, 1, &dma->map[0]);
 	uv_pages = 0; /* silence gcc. value is set and consumed only if: */
 	if (y_pages == y_dma.page_count) {
-		uv_pages = get_user_pages_unlocked(current, current->mm,
-					uv_dma.uaddr, uv_dma.page_count, 0, 1,
-					&dma->map[y_pages]);
+		uv_pages = get_user_pages_unlocked(uv_dma.uaddr,
+				uv_dma.page_count, 0, 1, &dma->map[y_pages]);
 	}
 
 	if (y_pages != y_dma.page_count || uv_pages != uv_dma.page_count) {
diff -puN drivers/media/v4l2-core/videobuf-dma-sg.c~get_current_user_pages drivers/media/v4l2-core/videobuf-dma-sg.c
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c~get_current_user_pages	2015-12-03 16:21:17.728313111 -0800
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c	2015-12-03 16:21:17.769314970 -0800
@@ -181,8 +181,7 @@ static int videobuf_dma_init_user_locked
 	dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
 		data, size, dma->nr_pages);
 
-	err = get_user_pages(current, current->mm,
-			     data & PAGE_MASK, dma->nr_pages,
+	err = get_current_user_pages(data & PAGE_MASK, dma->nr_pages,
 			     rw == READ, 1, /* force */
 			     dma->pages, NULL);
 
diff -puN drivers/misc/sgi-gru/grufault.c~get_current_user_pages drivers/misc/sgi-gru/grufault.c
--- a/drivers/misc/sgi-gru/grufault.c~get_current_user_pages	2015-12-03 16:21:17.729313156 -0800
+++ b/drivers/misc/sgi-gru/grufault.c	2015-12-03 16:21:17.769314970 -0800
@@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct
 #else
 	*pageshift = PAGE_SHIFT;
 #endif
-	if (get_user_pages
-	    (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
+	if (get_current_user_pages(vaddr, 1, write, 0, &page, NULL) <= 0)
 		return -EFAULT;
 	*paddr = page_to_phys(page);
 	put_page(page);
diff -puN drivers/scsi/st.c~get_current_user_pages drivers/scsi/st.c
--- a/drivers/scsi/st.c~get_current_user_pages	2015-12-03 16:21:17.731313247 -0800
+++ b/drivers/scsi/st.c	2015-12-03 16:21:17.771315061 -0800
@@ -4786,8 +4786,6 @@ static int sgl_map_user_pages(struct st_
         /* Try to fault in all of the necessary pages */
         /* rw==READ means read from drive, write into memory area */
 	res = get_user_pages_unlocked(
-		current,
-		current->mm,
 		uaddr,
 		nr_pages,
 		rw == READ,
diff -puN drivers/video/fbdev/pvr2fb.c~get_current_user_pages drivers/video/fbdev/pvr2fb.c
--- a/drivers/video/fbdev/pvr2fb.c~get_current_user_pages	2015-12-03 16:21:17.733313338 -0800
+++ b/drivers/video/fbdev/pvr2fb.c	2015-12-03 16:21:17.771315061 -0800
@@ -686,8 +686,8 @@ static ssize_t pvr2fb_write(struct fb_in
 	if (!pages)
 		return -ENOMEM;
 
-	ret = get_user_pages_unlocked(current, current->mm, (unsigned long)buf,
-				      nr_pages, WRITE, 0, pages);
+	ret = get_user_pages_unlocked((unsigned long)buf, nr_pages, WRITE,
+			0, pages);
 
 	if (ret < nr_pages) {
 		nr_pages = ret;
diff -puN drivers/virt/fsl_hypervisor.c~get_current_user_pages drivers/virt/fsl_hypervisor.c
--- a/drivers/virt/fsl_hypervisor.c~get_current_user_pages	2015-12-03 16:21:17.734313383 -0800
+++ b/drivers/virt/fsl_hypervisor.c	2015-12-03 16:21:17.772315107 -0800
@@ -244,9 +244,8 @@ static long ioctl_memcpy(struct fsl_hv_i
 
 	/* Get the physical addresses of the source buffer */
 	down_read(&current->mm->mmap_sem);
-	num_pinned = get_user_pages(current, current->mm,
-		param.local_vaddr - lb_offset, num_pages,
-		(param.source == -1) ? READ : WRITE,
+	num_pinned = get_current_user_pages(param.local_vaddr - lb_offset,
+		num_pages, (param.source == -1) ? READ : WRITE,
 		0, pages, NULL);
 	up_read(&current->mm->mmap_sem);
 
diff -puN fs/exec.c~get_current_user_pages fs/exec.c
--- a/fs/exec.c~get_current_user_pages	2015-12-03 16:21:17.736313474 -0800
+++ b/fs/exec.c	2015-12-03 16:21:17.772315107 -0800
@@ -198,8 +198,12 @@ static struct page *get_arg_page(struct
 			return NULL;
 	}
 #endif
-	ret = get_user_pages(current, bprm->mm, pos,
-			1, write, 1, &page, NULL);
+	/*
+	 * We are doing an exec().  'current' is the process
+	 * doing the exec and bprm->mm is the new process's mm.
+	 */
+	ret = get_foreign_user_pages(current, bprm->mm, pos, 1, write,
+			1, &page, NULL);
 	if (ret <= 0)
 		return NULL;
 
diff -puN include/linux/mm.h~get_current_user_pages include/linux/mm.h
--- a/include/linux/mm.h~get_current_user_pages	2015-12-03 16:21:17.738313565 -0800
+++ b/include/linux/mm.h	2015-12-03 16:21:17.773315152 -0800
@@ -1191,24 +1191,39 @@ long __get_user_pages(struct task_struct
 		      unsigned long start, unsigned long nr_pages,
 		      unsigned int foll_flags, struct page **pages,
 		      struct vm_area_struct **vmas, int *nonblocking);
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
-		    int write, int force, struct page **pages,
-		    struct vm_area_struct **vmas);
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
-		    int write, int force, struct page **pages,
-		    int *locked);
-long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			       unsigned long start, unsigned long nr_pages,
+long get_foreign_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    struct vm_area_struct **vmas);
+long get_current_user_pages(unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    struct vm_area_struct **vmas);
+long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages, int *locked);
+long __get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 			       int write, int force, struct page **pages,
 			       unsigned int gup_flags);
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 
+/*
+ * All callers should use get_foreign_user_pages() or
+ * get_current_user_pages().  The foreign variant is the most
+ * permissive and is the least likely to break something in
+ * a negative way.
+ */
+static inline __deprecated
+long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		    unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages,
+		    struct vm_area_struct **vmas)
+{
+	return get_foreign_user_pages(tsk, mm, start, nr_pages, write, force,
+				      pages, vmas);
+}
+
 /* Container for pinned pfns / pages */
 struct frame_vector {
 	unsigned int nr_allocated;	/* Number of frames we have space for */
diff -puN kernel/events/uprobes.c~get_current_user_pages kernel/events/uprobes.c
--- a/kernel/events/uprobes.c~get_current_user_pages	2015-12-03 16:21:17.739313610 -0800
+++ b/kernel/events/uprobes.c	2015-12-03 16:21:17.774315197 -0800
@@ -298,7 +298,7 @@ int uprobe_write_opcode(struct mm_struct
 
 retry:
 	/* Read the page with vaddr into memory */
-	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
+	ret = get_foreign_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
 	if (ret <= 0)
 		return ret;
 
@@ -1699,7 +1699,7 @@ static int is_trap_at_addr(struct mm_str
 	if (likely(result == 0))
 		goto out;
 
-	result = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
+	result = get_current_user_pages(vaddr, 1, 0, 1, &page, NULL);
 	if (result < 0)
 		return result;
 
diff -puN mm/frame_vector.c~get_current_user_pages mm/frame_vector.c
--- a/mm/frame_vector.c~get_current_user_pages	2015-12-03 16:21:17.741313701 -0800
+++ b/mm/frame_vector.c	2015-12-03 16:21:17.774315197 -0800
@@ -58,7 +58,7 @@ int get_vaddr_frames(unsigned long start
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
 		vec->got_ref = true;
 		vec->is_pfns = false;
-		ret = get_user_pages_locked(current, mm, start, nr_frames,
+		ret = get_user_pages_locked(start, nr_frames,
 			write, force, (struct page **)(vec->ptrs), &locked);
 		goto out;
 	}
diff -puN mm/gup.c~get_current_user_pages mm/gup.c
--- a/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.743313791 -0800
+++ b/mm/gup.c	2015-12-03 16:21:17.775315243 -0800
@@ -735,13 +735,13 @@ static __always_inline long __get_user_p
  *      if (locked)
  *          up_read(&mm->mmap_sem);
  */
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-			   unsigned long start, unsigned long nr_pages,
+long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 			   int write, int force, struct page **pages,
 			   int *locked)
 {
-	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
-				       pages, NULL, locked, true, FOLL_TOUCH);
+	return __get_user_pages_locked(current, current->mm, start, nr_pages,
+				       write, force, pages, NULL, locked, true,
+				       FOLL_TOUCH);
 }
 EXPORT_SYMBOL(get_user_pages_locked);
 
@@ -755,11 +755,12 @@ EXPORT_SYMBOL(get_user_pages_locked);
  * according to the parameters "pages", "write", "force"
  * respectively.
  */
-__always_inline long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-					       unsigned long start, unsigned long nr_pages,
+__always_inline long __get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 					       int write, int force, struct page **pages,
 					       unsigned int gup_flags)
 {
+	struct task_struct *tsk = current;
+	struct mm_struct *mm = tsk->mm;
 	long ret;
 	int locked = 1;
 	down_read(&mm->mmap_sem);
@@ -788,17 +789,16 @@ EXPORT_SYMBOL(__get_user_pages_unlocked)
  * or if "force" shall be set to 1 (get_user_pages_fast misses the
  * "force" parameter).
  */
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			     unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 			     int write, int force, struct page **pages)
 {
-	return __get_user_pages_unlocked(tsk, mm, start, nr_pages, write,
+	return __get_user_pages_unlocked(start, nr_pages, write,
 					 force, pages, FOLL_TOUCH);
 }
 EXPORT_SYMBOL(get_user_pages_unlocked);
 
 /*
- * get_user_pages() - pin user pages in memory
+ * get_foreign_user_pages() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
  *		NULL if faults are not to be recorded.
  * @mm:		mm_struct of target mm
@@ -852,14 +852,30 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  * should use get_user_pages because it cannot pass
  * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
  */
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, unsigned long nr_pages, int write,
-		int force, struct page **pages, struct vm_area_struct **vmas)
+long get_foreign_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas)
 {
 	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
-				       pages, vmas, NULL, false, FOLL_TOUCH);
+					pages, vmas, NULL, false, FOLL_TOUCH);
 }
-EXPORT_SYMBOL(get_user_pages);
+EXPORT_SYMBOL(get_foreign_user_pages);
+
+/*
+ * This is exactly the same as get_foreign_user_pages(), just
+ * with a less-flexible calling convention where we assume that
+ * the task and mm being operated on are the current task's.
+ */
+long get_current_user_pages(unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas)
+{
+	return get_foreign_user_pages(current, current->mm,
+				      start, nr_pages, write, force,
+				      pages, vmas);
+}
+EXPORT_SYMBOL(get_current_user_pages);
 
 /**
  * populate_vma_page_range() -  populate a range of pages in the vma.
@@ -1395,7 +1411,6 @@ int __get_user_pages_fast(unsigned long
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages)
 {
-	struct mm_struct *mm = current->mm;
 	int nr, ret;
 
 	start &= PAGE_MASK;
@@ -1407,8 +1422,8 @@ int get_user_pages_fast(unsigned long st
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
-					      nr_pages - nr, write, 0, pages);
+		ret = get_user_pages_unlocked(start, nr_pages - nr, write, 0,
+					      pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff -puN mm/memory.c~get_current_user_pages mm/memory.c
--- a/mm/memory.c~get_current_user_pages	2015-12-03 16:21:17.745313882 -0800
+++ b/mm/memory.c	2015-12-03 16:21:17.776315288 -0800
@@ -3659,7 +3659,7 @@ static int __access_remote_vm(struct tas
 		void *maddr;
 		struct page *page = NULL;
 
-		ret = get_user_pages(tsk, mm, addr, 1,
+		ret = get_foreign_user_pages(tsk, mm, addr, 1,
 				write, 1, &page, &vma);
 		if (ret <= 0) {
 #ifndef CONFIG_HAVE_IOREMAP_PROT
diff -puN mm/mempolicy.c~get_current_user_pages mm/mempolicy.c
--- a/mm/mempolicy.c~get_current_user_pages	2015-12-03 16:21:17.746313927 -0800
+++ b/mm/mempolicy.c	2015-12-03 16:21:17.777315333 -0800
@@ -813,12 +813,12 @@ static void get_policy_nodemask(struct m
 	}
 }
 
-static int lookup_node(struct mm_struct *mm, unsigned long addr)
+static int lookup_node(unsigned long addr)
 {
 	struct page *p;
 	int err;
 
-	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+	err = get_current_user_pages(addr & PAGE_MASK, 1, 0, 0, &p, NULL);
 	if (err >= 0) {
 		err = page_to_nid(p);
 		put_page(p);
@@ -873,7 +873,7 @@ static long do_get_mempolicy(int *policy
 
 	if (flags & MPOL_F_NODE) {
 		if (flags & MPOL_F_ADDR) {
-			err = lookup_node(mm, addr);
+			err = lookup_node(addr);
 			if (err < 0)
 				goto out;
 			*policy = err;
diff -puN mm/nommu.c~get_current_user_pages mm/nommu.c
--- a/mm/nommu.c~get_current_user_pages	2015-12-03 16:21:17.748314018 -0800
+++ b/mm/nommu.c	2015-12-03 16:21:17.778315379 -0800
@@ -182,7 +182,7 @@ finish_or_fault:
  *   slab page or a secondary page from a compound page
  * - don't permit access to VMAs that don't support it, such as I/O mappings
  */
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+long get_foreign_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		    unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages,
 		    struct vm_area_struct **vmas)
@@ -199,35 +199,41 @@ long get_user_pages(struct task_struct *
 }
 EXPORT_SYMBOL(get_user_pages);
 
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-			   unsigned long start, unsigned long nr_pages,
+long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 			   int write, int force, struct page **pages,
 			   int *locked)
 {
-	return get_user_pages(tsk, mm, start, nr_pages, write, force,
-			      pages, NULL);
+	return get_user_pages(current, current->mm, start, nr_pages, write,
+			      force, pages, NULL);
 }
 EXPORT_SYMBOL(get_user_pages_locked);
 
-long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			       unsigned long start, unsigned long nr_pages,
+long get_current_user_pages(unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages,
+		    struct vm_area_struct **vmas)
+{
+	return get_foreign_user_pages(current, current->mm, start, nr_pages,
+				      write, force, pages, vmas);
+}
+EXPORT_SYMBOL(get_current_user_pages);
+
+long __get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 			       int write, int force, struct page **pages,
 			       unsigned int gup_flags)
 {
 	long ret;
-	down_read(&mm->mmap_sem);
-	ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
-			     pages, NULL);
-	up_read(&mm->mmap_sem);
+	down_read(&current->mm->mmap_sem);
+	ret = get_current_user_pages(start, nr_pages, write, force,
+				     pages, NULL);
+	up_read(&current->mm->mmap_sem);
 	return ret;
 }
 EXPORT_SYMBOL(__get_user_pages_unlocked);
 
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			     unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 			     int write, int force, struct page **pages)
 {
-	return __get_user_pages_unlocked(tsk, mm, start, nr_pages, write,
+	return __get_user_pages_unlocked(start, nr_pages, write,
 					 force, pages, 0);
 }
 EXPORT_SYMBOL(get_user_pages_unlocked);
diff -puN mm/process_vm_access.c~get_current_user_pages mm/process_vm_access.c
--- a/mm/process_vm_access.c~get_current_user_pages	2015-12-03 16:21:17.750314109 -0800
+++ b/mm/process_vm_access.c	2015-12-03 16:21:17.778315379 -0800
@@ -99,8 +99,10 @@ static int process_vm_rw_single_vec(unsi
 		size_t bytes;
 
 		/* Get the pages we're interested in */
-		pages = get_user_pages_unlocked(task, mm, pa, pages,
-						vm_write, 0, process_pages);
+		down_read(&mm->mmap_sem);
+		pages = get_foreign_user_pages(task, mm, pa, pages, vm_write,
+						0, process_pages, NULL);
+		up_read(&mm->mmap_sem);
 		if (pages <= 0)
 			return -EFAULT;
 
diff -puN mm/util.c~get_current_user_pages mm/util.c
--- a/mm/util.c~get_current_user_pages	2015-12-03 16:21:17.751314154 -0800
+++ b/mm/util.c	2015-12-03 16:21:17.779315424 -0800
@@ -277,9 +277,7 @@ EXPORT_SYMBOL_GPL(__get_user_pages_fast)
 int __weak get_user_pages_fast(unsigned long start,
 				int nr_pages, int write, struct page **pages)
 {
-	struct mm_struct *mm = current->mm;
-	return get_user_pages_unlocked(current, mm, start, nr_pages,
-				       write, 0, pages);
+	return get_user_pages_unlocked(start, nr_pages, write, 0, pages);
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
diff -puN net/ceph/pagevec.c~get_current_user_pages net/ceph/pagevec.c
--- a/net/ceph/pagevec.c~get_current_user_pages	2015-12-03 16:21:17.753314245 -0800
+++ b/net/ceph/pagevec.c	2015-12-03 16:21:17.779315424 -0800
@@ -24,7 +24,7 @@ struct page **ceph_get_direct_page_vecto
 		return ERR_PTR(-ENOMEM);
 
 	while (got < num_pages) {
-		rc = get_user_pages_unlocked(current, current->mm,
+		rc = get_user_pages_unlocked(
 		    (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
 		    num_pages - got, write_page, 0, pages + got);
 		if (rc < 0)
diff -puN security/tomoyo/domain.c~get_current_user_pages security/tomoyo/domain.c
--- a/security/tomoyo/domain.c~get_current_user_pages	2015-12-03 16:21:17.755314336 -0800
+++ b/security/tomoyo/domain.c	2015-12-03 16:21:17.779315424 -0800
@@ -874,7 +874,14 @@ bool tomoyo_dump_page(struct linux_binpr
 	}
 	/* Same with get_arg_page(bprm, pos, 0) in fs/exec.c */
 #ifdef CONFIG_MMU
-	if (get_user_pages(current, bprm->mm, pos, 1, 0, 1, &page, NULL) <= 0)
+	/*
+	 * This is called at execve() time in order to dig around
+	 * in the argv/environment of the new proceess
+	 * (represented by bprm).  'current' is the process doing
+	 * the execve().
+	 */
+	if (get_foreign_user_pages(current, bprm->mm, pos, 1,
+				0, 1, &page, NULL) <= 0)
 		return false;
 #else
 	page = bprm->page[pos / PAGE_SIZE];
diff -puN virt/kvm/async_pf.c~get_current_user_pages virt/kvm/async_pf.c
--- a/virt/kvm/async_pf.c~get_current_user_pages	2015-12-03 16:21:17.756314381 -0800
+++ b/virt/kvm/async_pf.c	2015-12-03 16:21:17.780315469 -0800
@@ -80,7 +80,7 @@ static void async_pf_execute(struct work
 
 	might_sleep();
 
-	get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
+	get_user_pages_unlocked(addr, 1, 1, 0, NULL);
 	kvm_async_page_present_sync(vcpu, apf);
 
 	spin_lock(&vcpu->async_pf.lock);
diff -puN virt/kvm/kvm_main.c~get_current_user_pages virt/kvm/kvm_main.c
--- a/virt/kvm/kvm_main.c~get_current_user_pages	2015-12-03 16:21:17.758314472 -0800
+++ b/virt/kvm/kvm_main.c	2015-12-03 16:21:17.781315515 -0800
@@ -1274,15 +1274,16 @@ unsigned long kvm_vcpu_gfn_to_hva_prot(s
 	return gfn_to_hva_memslot_prot(slot, gfn, writable);
 }
 
-static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
-	unsigned long start, int write, struct page **page)
+static int get_user_page_nowait(unsigned long start, int write,
+		struct page **page)
 {
 	int flags = FOLL_TOUCH | FOLL_NOWAIT | FOLL_HWPOISON | FOLL_GET;
 
 	if (write)
 		flags |= FOLL_WRITE;
 
-	return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
+	return __get_user_pages(current, current->mm, start, 1, flags, page,
+			NULL, NULL);
 }
 
 static inline int check_user_page_hwpoison(unsigned long addr)
@@ -1344,12 +1345,10 @@ static int hva_to_pfn_slow(unsigned long
 
 	if (async) {
 		down_read(&current->mm->mmap_sem);
-		npages = get_user_page_nowait(current, current->mm,
-					      addr, write_fault, page);
+		npages = get_user_page_nowait(addr, write_fault, page);
 		up_read(&current->mm->mmap_sem);
 	} else
-		npages = __get_user_pages_unlocked(current, current->mm, addr, 1,
-						   write_fault, 0, page,
+		npages = __get_user_pages_unlocked(addr, 1, write_fault, 0, page,
 						   FOLL_TOUCH|FOLL_HWPOISON);
 	if (npages != 1)
 		return npages;
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 01/34] mm, gup: introduce concept of "foreign" get_user_pages()
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, Dave Hansen, dave.hansen, akpm, kirill.shutemov,
	aarcange, n-horiguchi


From: Dave Hansen <dave.hansen@linux.intel.com>

For protection keys, we need to understand whether protections
should be enforced in software or not.  In general, we enforce
protections when working on our own task, but not when on others.
We call these "current" and "foreign" operations.

This introduces two new get_user_pages() variants:

	get_current_user_pages()
	get_foreign_user_pages()

get_current_user_pages() is a drop-in replacement for when
get_user_pages() was called with (current, current->mm, ...) as
arguments.  Using it makes a few of the call sites look a bit
nicer.

get_foreign_user_pages() is a replacement for when
get_user_pages() is called on non-current tsk/mm.

We leave a stub get_user_pages() around with a __deprecated
warning.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---

 b/arch/mips/mm/gup.c                          |    3 -
 b/arch/s390/mm/gup.c                          |    3 -
 b/arch/sh/mm/gup.c                            |    2 -
 b/arch/sparc/mm/gup.c                         |    2 -
 b/arch/x86/mm/gup.c                           |    2 -
 b/arch/x86/mm/mpx.c                           |    4 +-
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     |    4 +-
 b/drivers/gpu/drm/i915/i915_gem_userptr.c     |    2 -
 b/drivers/gpu/drm/radeon/radeon_ttm.c         |    4 +-
 b/drivers/gpu/drm/via/via_dmablit.c           |    3 -
 b/drivers/infiniband/core/umem.c              |    2 -
 b/drivers/infiniband/core/umem_odp.c          |    8 ++--
 b/drivers/infiniband/hw/mthca/mthca_memfree.c |    3 -
 b/drivers/infiniband/hw/qib/qib_user_pages.c  |    3 -
 b/drivers/infiniband/hw/usnic/usnic_uiom.c    |    2 -
 b/drivers/media/pci/ivtv/ivtv-udma.c          |    4 +-
 b/drivers/media/pci/ivtv/ivtv-yuv.c           |   10 ++---
 b/drivers/media/v4l2-core/videobuf-dma-sg.c   |    3 -
 b/drivers/misc/sgi-gru/grufault.c             |    3 -
 b/drivers/scsi/st.c                           |    2 -
 b/drivers/video/fbdev/pvr2fb.c                |    4 +-
 b/drivers/virt/fsl_hypervisor.c               |    5 +-
 b/fs/exec.c                                   |    8 +++-
 b/include/linux/mm.h                          |   39 +++++++++++++------
 b/kernel/events/uprobes.c                     |    4 +-
 b/mm/frame_vector.c                           |    2 -
 b/mm/gup.c                                    |   51 ++++++++++++++++----------
 b/mm/memory.c                                 |    2 -
 b/mm/mempolicy.c                              |    6 +--
 b/mm/nommu.c                                  |   34 ++++++++++-------
 b/mm/process_vm_access.c                      |    6 ++-
 b/mm/util.c                                   |    4 --
 b/net/ceph/pagevec.c                          |    2 -
 b/security/tomoyo/domain.c                    |    9 ++++
 b/virt/kvm/async_pf.c                         |    2 -
 b/virt/kvm/kvm_main.c                         |   13 +++---
 36 files changed, 147 insertions(+), 113 deletions(-)

diff -puN arch/mips/mm/gup.c~get_current_user_pages arch/mips/mm/gup.c
--- a/arch/mips/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.700311841 -0800
+++ b/arch/mips/mm/gup.c	2015-12-03 16:21:17.762314653 -0800
@@ -301,8 +301,7 @@ slow_irqon:
 	start += nr << PAGE_SHIFT;
 	pages += nr;
 
-	ret = get_user_pages_unlocked(current, mm, start,
-				      (end - start) >> PAGE_SHIFT,
+	ret = get_user_pages_unlocked(start, (end - start) >> PAGE_SHIFT,
 				      write, 0, pages);
 
 	/* Have to be a bit careful with return values */
diff -puN arch/s390/mm/gup.c~get_current_user_pages arch/s390/mm/gup.c
--- a/arch/s390/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.701311886 -0800
+++ b/arch/s390/mm/gup.c	2015-12-03 16:21:17.762314653 -0800
@@ -241,8 +241,7 @@ int get_user_pages_fast(unsigned long st
 	/* Try to get the remaining pages with get_user_pages */
 	start += nr << PAGE_SHIFT;
 	pages += nr;
-	ret = get_user_pages_unlocked(current, mm, start,
-			     nr_pages - nr, write, 0, pages);
+	ret = get_user_pages_unlocked(start, nr_pages - nr, write, 0, pages);
 	/* Have to be a bit careful with return values */
 	if (nr > 0)
 		ret = (ret < 0) ? nr : ret + nr;
diff -puN arch/sh/mm/gup.c~get_current_user_pages arch/sh/mm/gup.c
--- a/arch/sh/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.703311977 -0800
+++ b/arch/sh/mm/gup.c	2015-12-03 16:21:17.762314653 -0800
@@ -257,7 +257,7 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
diff -puN arch/sparc/mm/gup.c~get_current_user_pages arch/sparc/mm/gup.c
--- a/arch/sparc/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.704312023 -0800
+++ b/arch/sparc/mm/gup.c	2015-12-03 16:21:17.763314698 -0800
@@ -249,7 +249,7 @@ slow:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
diff -puN arch/x86/mm/gup.c~get_current_user_pages arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.706312113 -0800
+++ b/arch/x86/mm/gup.c	2015-12-03 16:21:17.763314698 -0800
@@ -386,7 +386,7 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 					      (end - start) >> PAGE_SHIFT,
 					      write, 0, pages);
 
diff -puN arch/x86/mm/mpx.c~get_current_user_pages arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~get_current_user_pages	2015-12-03 16:21:17.708312204 -0800
+++ b/arch/x86/mm/mpx.c	2015-12-03 16:21:17.763314698 -0800
@@ -546,8 +546,8 @@ static int mpx_resolve_fault(long __user
 	int nr_pages = 1;
 	int force = 0;
 
-	gup_ret = get_user_pages(current, current->mm, (unsigned long)addr,
-				 nr_pages, write, force, NULL, NULL);
+	gup_ret = get_current_user_pages((unsigned long)addr, nr_pages, write,
+			force, NULL, NULL);
 	/*
 	 * get_user_pages() returns number of pages gotten.
 	 * 0 means we failed to fault in and get anything,
diff -puN drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c~get_current_user_pages drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c~get_current_user_pages	2015-12-03 16:21:17.709312249 -0800
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c	2015-12-03 16:21:17.764314744 -0800
@@ -518,8 +518,8 @@ static int amdgpu_ttm_tt_pin_userptr(str
 		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
 		struct page **pages = ttm->pages + pinned;
 
-		r = get_user_pages(current, current->mm, userptr, num_pages,
-				   write, 0, pages, NULL);
+		r = get_current_user_pages(userptr, num_pages, write, 0, pages,
+				NULL);
 		if (r < 0)
 			goto release_pages;
 
diff -puN drivers/gpu/drm/i915/i915_gem_userptr.c~get_current_user_pages drivers/gpu/drm/i915/i915_gem_userptr.c
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c~get_current_user_pages	2015-12-03 16:21:17.711312340 -0800
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c	2015-12-03 16:21:17.764314744 -0800
@@ -587,7 +587,7 @@ __i915_gem_userptr_get_pages_worker(stru
 
 		down_read(&mm->mmap_sem);
 		while (pinned < num_pages) {
-			ret = get_user_pages(work->task, mm,
+			ret = get_foreign_user_pages(work->task, mm,
 					     obj->userptr.ptr + pinned * PAGE_SIZE,
 					     num_pages - pinned,
 					     !obj->userptr.read_only, 0,
diff -puN drivers/gpu/drm/radeon/radeon_ttm.c~get_current_user_pages drivers/gpu/drm/radeon/radeon_ttm.c
--- a/drivers/gpu/drm/radeon/radeon_ttm.c~get_current_user_pages	2015-12-03 16:21:17.713312431 -0800
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c	2015-12-03 16:21:17.765314789 -0800
@@ -554,8 +554,8 @@ static int radeon_ttm_tt_pin_userptr(str
 		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
 		struct page **pages = ttm->pages + pinned;
 
-		r = get_user_pages(current, current->mm, userptr, num_pages,
-				   write, 0, pages, NULL);
+		r = get_current_user_pages(userptr, num_pages, write, 0, pages,
+				NULL);
 		if (r < 0)
 			goto release_pages;
 
diff -puN drivers/gpu/drm/via/via_dmablit.c~get_current_user_pages drivers/gpu/drm/via/via_dmablit.c
--- a/drivers/gpu/drm/via/via_dmablit.c~get_current_user_pages	2015-12-03 16:21:17.714312476 -0800
+++ b/drivers/gpu/drm/via/via_dmablit.c	2015-12-03 16:21:17.765314789 -0800
@@ -239,8 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t
 	if (NULL == vsg->pages)
 		return -ENOMEM;
 	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(current, current->mm,
-			     (unsigned long)xfer->mem_addr,
+	ret = get_current_user_pages((unsigned long)xfer->mem_addr,
 			     vsg->num_pages,
 			     (vsg->direction == DMA_FROM_DEVICE),
 			     0, vsg->pages, NULL);
diff -puN drivers/infiniband/core/umem.c~get_current_user_pages drivers/infiniband/core/umem.c
--- a/drivers/infiniband/core/umem.c~get_current_user_pages	2015-12-03 16:21:17.716312567 -0800
+++ b/drivers/infiniband/core/umem.c	2015-12-03 16:21:17.766314834 -0800
@@ -188,7 +188,7 @@ struct ib_umem *ib_umem_get(struct ib_uc
 	sg_list_start = umem->sg_head.sgl;
 
 	while (npages) {
-		ret = get_user_pages(current, current->mm, cur_base,
+		ret = get_current_user_pages(cur_base,
 				     min_t(unsigned long, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
 				     1, !umem->writable, page_list, vma_list);
diff -puN drivers/infiniband/core/umem_odp.c~get_current_user_pages drivers/infiniband/core/umem_odp.c
--- a/drivers/infiniband/core/umem_odp.c~get_current_user_pages	2015-12-03 16:21:17.718312657 -0800
+++ b/drivers/infiniband/core/umem_odp.c	2015-12-03 16:21:17.766314834 -0800
@@ -572,10 +572,10 @@ int ib_umem_odp_map_dma_pages(struct ib_
 		 * complex (and doesn't gain us much performance in most use
 		 * cases).
 		 */
-		npages = get_user_pages(owning_process, owning_mm, user_virt,
-					gup_num_pages,
-					access_mask & ODP_WRITE_ALLOWED_BIT, 0,
-					local_page_list, NULL);
+		npages = get_foreign_user_pages(owning_process, owning_mm,
+				user_virt, gup_num_pages,
+				access_mask & ODP_WRITE_ALLOWED_BIT,
+				0, local_page_list, NULL);
 		up_read(&owning_mm->mmap_sem);
 
 		if (npages < 0)
diff -puN drivers/infiniband/hw/mthca/mthca_memfree.c~get_current_user_pages drivers/infiniband/hw/mthca/mthca_memfree.c
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c~get_current_user_pages	2015-12-03 16:21:17.719312703 -0800
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c	2015-12-03 16:21:17.767314880 -0800
@@ -472,8 +472,7 @@ int mthca_map_user_db(struct mthca_dev *
 		goto out;
 	}
 
-	ret = get_user_pages(current, current->mm, uaddr & PAGE_MASK, 1, 1, 0,
-			     pages, NULL);
+	ret = get_current_user_pages(uaddr & PAGE_MASK, 1, 1, 0, pages, NULL);
 	if (ret < 0)
 		goto out;
 
diff -puN drivers/infiniband/hw/qib/qib_user_pages.c~get_current_user_pages drivers/infiniband/hw/qib/qib_user_pages.c
--- a/drivers/infiniband/hw/qib/qib_user_pages.c~get_current_user_pages	2015-12-03 16:21:17.721312794 -0800
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c	2015-12-03 16:21:17.767314880 -0800
@@ -66,8 +66,7 @@ static int __qib_get_user_pages(unsigned
 	}
 
 	for (got = 0; got < num_pages; got += ret) {
-		ret = get_user_pages(current, current->mm,
-				     start_page + got * PAGE_SIZE,
+		ret = get_current_user_pages(start_page + got * PAGE_SIZE,
 				     num_pages - got, 1, 1,
 				     p + got, NULL);
 		if (ret < 0)
diff -puN drivers/infiniband/hw/usnic/usnic_uiom.c~get_current_user_pages drivers/infiniband/hw/usnic/usnic_uiom.c
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c~get_current_user_pages	2015-12-03 16:21:17.723312884 -0800
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c	2015-12-03 16:21:17.767314880 -0800
@@ -144,7 +144,7 @@ static int usnic_uiom_get_pages(unsigned
 	ret = 0;
 
 	while (npages) {
-		ret = get_user_pages(current, current->mm, cur_base,
+		ret = get_current_user_pages(cur_base,
 					min_t(unsigned long, npages,
 					PAGE_SIZE / sizeof(struct page *)),
 					1, !writable, page_list, NULL);
diff -puN drivers/media/pci/ivtv/ivtv-udma.c~get_current_user_pages drivers/media/pci/ivtv/ivtv-udma.c
--- a/drivers/media/pci/ivtv/ivtv-udma.c~get_current_user_pages	2015-12-03 16:21:17.724312930 -0800
+++ b/drivers/media/pci/ivtv/ivtv-udma.c	2015-12-03 16:21:17.768314925 -0800
@@ -124,8 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, un
 	}
 
 	/* Get user pages for DMA Xfer */
-	err = get_user_pages_unlocked(current, current->mm,
-			user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
+	err = get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, 0,
+			1, dma->map);
 
 	if (user_dma.page_count != err) {
 		IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
diff -puN drivers/media/pci/ivtv/ivtv-yuv.c~get_current_user_pages drivers/media/pci/ivtv/ivtv-yuv.c
--- a/drivers/media/pci/ivtv/ivtv-yuv.c~get_current_user_pages	2015-12-03 16:21:17.726313020 -0800
+++ b/drivers/media/pci/ivtv/ivtv-yuv.c	2015-12-03 16:21:17.768314925 -0800
@@ -75,14 +75,12 @@ static int ivtv_yuv_prep_user_dma(struct
 	ivtv_udma_get_page_info (&uv_dma, (unsigned long)args->uv_source, 360 * uv_decode_height);
 
 	/* Get user pages for DMA Xfer */
-	y_pages = get_user_pages_unlocked(current, current->mm,
-				y_dma.uaddr, y_dma.page_count, 0, 1,
-				&dma->map[0]);
+	y_pages = get_user_pages_unlocked(y_dma.uaddr,
+			y_dma.page_count, 0, 1, &dma->map[0]);
 	uv_pages = 0; /* silence gcc. value is set and consumed only if: */
 	if (y_pages == y_dma.page_count) {
-		uv_pages = get_user_pages_unlocked(current, current->mm,
-					uv_dma.uaddr, uv_dma.page_count, 0, 1,
-					&dma->map[y_pages]);
+		uv_pages = get_user_pages_unlocked(uv_dma.uaddr,
+				uv_dma.page_count, 0, 1, &dma->map[y_pages]);
 	}
 
 	if (y_pages != y_dma.page_count || uv_pages != uv_dma.page_count) {
diff -puN drivers/media/v4l2-core/videobuf-dma-sg.c~get_current_user_pages drivers/media/v4l2-core/videobuf-dma-sg.c
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c~get_current_user_pages	2015-12-03 16:21:17.728313111 -0800
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c	2015-12-03 16:21:17.769314970 -0800
@@ -181,8 +181,7 @@ static int videobuf_dma_init_user_locked
 	dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
 		data, size, dma->nr_pages);
 
-	err = get_user_pages(current, current->mm,
-			     data & PAGE_MASK, dma->nr_pages,
+	err = get_current_user_pages(data & PAGE_MASK, dma->nr_pages,
 			     rw == READ, 1, /* force */
 			     dma->pages, NULL);
 
diff -puN drivers/misc/sgi-gru/grufault.c~get_current_user_pages drivers/misc/sgi-gru/grufault.c
--- a/drivers/misc/sgi-gru/grufault.c~get_current_user_pages	2015-12-03 16:21:17.729313156 -0800
+++ b/drivers/misc/sgi-gru/grufault.c	2015-12-03 16:21:17.769314970 -0800
@@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct
 #else
 	*pageshift = PAGE_SHIFT;
 #endif
-	if (get_user_pages
-	    (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
+	if (get_current_user_pages(vaddr, 1, write, 0, &page, NULL) <= 0)
 		return -EFAULT;
 	*paddr = page_to_phys(page);
 	put_page(page);
diff -puN drivers/scsi/st.c~get_current_user_pages drivers/scsi/st.c
--- a/drivers/scsi/st.c~get_current_user_pages	2015-12-03 16:21:17.731313247 -0800
+++ b/drivers/scsi/st.c	2015-12-03 16:21:17.771315061 -0800
@@ -4786,8 +4786,6 @@ static int sgl_map_user_pages(struct st_
         /* Try to fault in all of the necessary pages */
         /* rw==READ means read from drive, write into memory area */
 	res = get_user_pages_unlocked(
-		current,
-		current->mm,
 		uaddr,
 		nr_pages,
 		rw == READ,
diff -puN drivers/video/fbdev/pvr2fb.c~get_current_user_pages drivers/video/fbdev/pvr2fb.c
--- a/drivers/video/fbdev/pvr2fb.c~get_current_user_pages	2015-12-03 16:21:17.733313338 -0800
+++ b/drivers/video/fbdev/pvr2fb.c	2015-12-03 16:21:17.771315061 -0800
@@ -686,8 +686,8 @@ static ssize_t pvr2fb_write(struct fb_in
 	if (!pages)
 		return -ENOMEM;
 
-	ret = get_user_pages_unlocked(current, current->mm, (unsigned long)buf,
-				      nr_pages, WRITE, 0, pages);
+	ret = get_user_pages_unlocked((unsigned long)buf, nr_pages, WRITE,
+			0, pages);
 
 	if (ret < nr_pages) {
 		nr_pages = ret;
diff -puN drivers/virt/fsl_hypervisor.c~get_current_user_pages drivers/virt/fsl_hypervisor.c
--- a/drivers/virt/fsl_hypervisor.c~get_current_user_pages	2015-12-03 16:21:17.734313383 -0800
+++ b/drivers/virt/fsl_hypervisor.c	2015-12-03 16:21:17.772315107 -0800
@@ -244,9 +244,8 @@ static long ioctl_memcpy(struct fsl_hv_i
 
 	/* Get the physical addresses of the source buffer */
 	down_read(&current->mm->mmap_sem);
-	num_pinned = get_user_pages(current, current->mm,
-		param.local_vaddr - lb_offset, num_pages,
-		(param.source == -1) ? READ : WRITE,
+	num_pinned = get_current_user_pages(param.local_vaddr - lb_offset,
+		num_pages, (param.source == -1) ? READ : WRITE,
 		0, pages, NULL);
 	up_read(&current->mm->mmap_sem);
 
diff -puN fs/exec.c~get_current_user_pages fs/exec.c
--- a/fs/exec.c~get_current_user_pages	2015-12-03 16:21:17.736313474 -0800
+++ b/fs/exec.c	2015-12-03 16:21:17.772315107 -0800
@@ -198,8 +198,12 @@ static struct page *get_arg_page(struct
 			return NULL;
 	}
 #endif
-	ret = get_user_pages(current, bprm->mm, pos,
-			1, write, 1, &page, NULL);
+	/*
+	 * We are doing an exec().  'current' is the process
+	 * doing the exec and bprm->mm is the new process's mm.
+	 */
+	ret = get_foreign_user_pages(current, bprm->mm, pos, 1, write,
+			1, &page, NULL);
 	if (ret <= 0)
 		return NULL;
 
diff -puN include/linux/mm.h~get_current_user_pages include/linux/mm.h
--- a/include/linux/mm.h~get_current_user_pages	2015-12-03 16:21:17.738313565 -0800
+++ b/include/linux/mm.h	2015-12-03 16:21:17.773315152 -0800
@@ -1191,24 +1191,39 @@ long __get_user_pages(struct task_struct
 		      unsigned long start, unsigned long nr_pages,
 		      unsigned int foll_flags, struct page **pages,
 		      struct vm_area_struct **vmas, int *nonblocking);
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
-		    int write, int force, struct page **pages,
-		    struct vm_area_struct **vmas);
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
-		    int write, int force, struct page **pages,
-		    int *locked);
-long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			       unsigned long start, unsigned long nr_pages,
+long get_foreign_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    struct vm_area_struct **vmas);
+long get_current_user_pages(unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    struct vm_area_struct **vmas);
+long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages, int *locked);
+long __get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 			       int write, int force, struct page **pages,
 			       unsigned int gup_flags);
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 
+/*
+ * All callers should use get_foreign_user_pages() or
+ * get_current_user_pages().  The foreign variant is the most
+ * permissive and is the least likely to break something in
+ * a negative way.
+ */
+static inline __deprecated
+long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		    unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages,
+		    struct vm_area_struct **vmas)
+{
+	return get_foreign_user_pages(tsk, mm, start, nr_pages, write, force,
+				      pages, vmas);
+}
+
 /* Container for pinned pfns / pages */
 struct frame_vector {
 	unsigned int nr_allocated;	/* Number of frames we have space for */
diff -puN kernel/events/uprobes.c~get_current_user_pages kernel/events/uprobes.c
--- a/kernel/events/uprobes.c~get_current_user_pages	2015-12-03 16:21:17.739313610 -0800
+++ b/kernel/events/uprobes.c	2015-12-03 16:21:17.774315197 -0800
@@ -298,7 +298,7 @@ int uprobe_write_opcode(struct mm_struct
 
 retry:
 	/* Read the page with vaddr into memory */
-	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
+	ret = get_foreign_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
 	if (ret <= 0)
 		return ret;
 
@@ -1699,7 +1699,7 @@ static int is_trap_at_addr(struct mm_str
 	if (likely(result == 0))
 		goto out;
 
-	result = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
+	result = get_current_user_pages(vaddr, 1, 0, 1, &page, NULL);
 	if (result < 0)
 		return result;
 
diff -puN mm/frame_vector.c~get_current_user_pages mm/frame_vector.c
--- a/mm/frame_vector.c~get_current_user_pages	2015-12-03 16:21:17.741313701 -0800
+++ b/mm/frame_vector.c	2015-12-03 16:21:17.774315197 -0800
@@ -58,7 +58,7 @@ int get_vaddr_frames(unsigned long start
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
 		vec->got_ref = true;
 		vec->is_pfns = false;
-		ret = get_user_pages_locked(current, mm, start, nr_frames,
+		ret = get_user_pages_locked(start, nr_frames,
 			write, force, (struct page **)(vec->ptrs), &locked);
 		goto out;
 	}
diff -puN mm/gup.c~get_current_user_pages mm/gup.c
--- a/mm/gup.c~get_current_user_pages	2015-12-03 16:21:17.743313791 -0800
+++ b/mm/gup.c	2015-12-03 16:21:17.775315243 -0800
@@ -735,13 +735,13 @@ static __always_inline long __get_user_p
  *      if (locked)
  *          up_read(&mm->mmap_sem);
  */
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-			   unsigned long start, unsigned long nr_pages,
+long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 			   int write, int force, struct page **pages,
 			   int *locked)
 {
-	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
-				       pages, NULL, locked, true, FOLL_TOUCH);
+	return __get_user_pages_locked(current, current->mm, start, nr_pages,
+				       write, force, pages, NULL, locked, true,
+				       FOLL_TOUCH);
 }
 EXPORT_SYMBOL(get_user_pages_locked);
 
@@ -755,11 +755,12 @@ EXPORT_SYMBOL(get_user_pages_locked);
  * according to the parameters "pages", "write", "force"
  * respectively.
  */
-__always_inline long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-					       unsigned long start, unsigned long nr_pages,
+__always_inline long __get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 					       int write, int force, struct page **pages,
 					       unsigned int gup_flags)
 {
+	struct task_struct *tsk = current;
+	struct mm_struct *mm = tsk->mm;
 	long ret;
 	int locked = 1;
 	down_read(&mm->mmap_sem);
@@ -788,17 +789,16 @@ EXPORT_SYMBOL(__get_user_pages_unlocked)
  * or if "force" shall be set to 1 (get_user_pages_fast misses the
  * "force" parameter).
  */
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			     unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 			     int write, int force, struct page **pages)
 {
-	return __get_user_pages_unlocked(tsk, mm, start, nr_pages, write,
+	return __get_user_pages_unlocked(start, nr_pages, write,
 					 force, pages, FOLL_TOUCH);
 }
 EXPORT_SYMBOL(get_user_pages_unlocked);
 
 /*
- * get_user_pages() - pin user pages in memory
+ * get_foreign_user_pages() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
  *		NULL if faults are not to be recorded.
  * @mm:		mm_struct of target mm
@@ -852,14 +852,30 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  * should use get_user_pages because it cannot pass
  * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
  */
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, unsigned long nr_pages, int write,
-		int force, struct page **pages, struct vm_area_struct **vmas)
+long get_foreign_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas)
 {
 	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
-				       pages, vmas, NULL, false, FOLL_TOUCH);
+					pages, vmas, NULL, false, FOLL_TOUCH);
 }
-EXPORT_SYMBOL(get_user_pages);
+EXPORT_SYMBOL(get_foreign_user_pages);
+
+/*
+ * This is exactly the same as get_foreign_user_pages(), just
+ * with a less-flexible calling convention where we assume that
+ * the task and mm being operated on are the current task's.
+ */
+long get_current_user_pages(unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas)
+{
+	return get_foreign_user_pages(current, current->mm,
+				      start, nr_pages, write, force,
+				      pages, vmas);
+}
+EXPORT_SYMBOL(get_current_user_pages);
 
 /**
  * populate_vma_page_range() -  populate a range of pages in the vma.
@@ -1395,7 +1411,6 @@ int __get_user_pages_fast(unsigned long
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages)
 {
-	struct mm_struct *mm = current->mm;
 	int nr, ret;
 
 	start &= PAGE_MASK;
@@ -1407,8 +1422,8 @@ int get_user_pages_fast(unsigned long st
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
-					      nr_pages - nr, write, 0, pages);
+		ret = get_user_pages_unlocked(start, nr_pages - nr, write, 0,
+					      pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff -puN mm/memory.c~get_current_user_pages mm/memory.c
--- a/mm/memory.c~get_current_user_pages	2015-12-03 16:21:17.745313882 -0800
+++ b/mm/memory.c	2015-12-03 16:21:17.776315288 -0800
@@ -3659,7 +3659,7 @@ static int __access_remote_vm(struct tas
 		void *maddr;
 		struct page *page = NULL;
 
-		ret = get_user_pages(tsk, mm, addr, 1,
+		ret = get_foreign_user_pages(tsk, mm, addr, 1,
 				write, 1, &page, &vma);
 		if (ret <= 0) {
 #ifndef CONFIG_HAVE_IOREMAP_PROT
diff -puN mm/mempolicy.c~get_current_user_pages mm/mempolicy.c
--- a/mm/mempolicy.c~get_current_user_pages	2015-12-03 16:21:17.746313927 -0800
+++ b/mm/mempolicy.c	2015-12-03 16:21:17.777315333 -0800
@@ -813,12 +813,12 @@ static void get_policy_nodemask(struct m
 	}
 }
 
-static int lookup_node(struct mm_struct *mm, unsigned long addr)
+static int lookup_node(unsigned long addr)
 {
 	struct page *p;
 	int err;
 
-	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+	err = get_current_user_pages(addr & PAGE_MASK, 1, 0, 0, &p, NULL);
 	if (err >= 0) {
 		err = page_to_nid(p);
 		put_page(p);
@@ -873,7 +873,7 @@ static long do_get_mempolicy(int *policy
 
 	if (flags & MPOL_F_NODE) {
 		if (flags & MPOL_F_ADDR) {
-			err = lookup_node(mm, addr);
+			err = lookup_node(addr);
 			if (err < 0)
 				goto out;
 			*policy = err;
diff -puN mm/nommu.c~get_current_user_pages mm/nommu.c
--- a/mm/nommu.c~get_current_user_pages	2015-12-03 16:21:17.748314018 -0800
+++ b/mm/nommu.c	2015-12-03 16:21:17.778315379 -0800
@@ -182,7 +182,7 @@ finish_or_fault:
  *   slab page or a secondary page from a compound page
  * - don't permit access to VMAs that don't support it, such as I/O mappings
  */
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+long get_foreign_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		    unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages,
 		    struct vm_area_struct **vmas)
@@ -199,35 +199,41 @@ long get_user_pages(struct task_struct *
 }
 EXPORT_SYMBOL(get_user_pages);
 
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-			   unsigned long start, unsigned long nr_pages,
+long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 			   int write, int force, struct page **pages,
 			   int *locked)
 {
-	return get_user_pages(tsk, mm, start, nr_pages, write, force,
-			      pages, NULL);
+	return get_user_pages(current, current->mm, start, nr_pages, write,
+			      force, pages, NULL);
 }
 EXPORT_SYMBOL(get_user_pages_locked);
 
-long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			       unsigned long start, unsigned long nr_pages,
+long get_current_user_pages(unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages,
+		    struct vm_area_struct **vmas)
+{
+	return get_foreign_user_pages(current, current->mm, start, nr_pages,
+				      write, force, pages, vmas);
+}
+EXPORT_SYMBOL(get_current_user_pages);
+
+long __get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 			       int write, int force, struct page **pages,
 			       unsigned int gup_flags)
 {
 	long ret;
-	down_read(&mm->mmap_sem);
-	ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
-			     pages, NULL);
-	up_read(&mm->mmap_sem);
+	down_read(&current->mm->mmap_sem);
+	ret = get_current_user_pages(start, nr_pages, write, force,
+				     pages, NULL);
+	up_read(&current->mm->mmap_sem);
 	return ret;
 }
 EXPORT_SYMBOL(__get_user_pages_unlocked);
 
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			     unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 			     int write, int force, struct page **pages)
 {
-	return __get_user_pages_unlocked(tsk, mm, start, nr_pages, write,
+	return __get_user_pages_unlocked(start, nr_pages, write,
 					 force, pages, 0);
 }
 EXPORT_SYMBOL(get_user_pages_unlocked);
diff -puN mm/process_vm_access.c~get_current_user_pages mm/process_vm_access.c
--- a/mm/process_vm_access.c~get_current_user_pages	2015-12-03 16:21:17.750314109 -0800
+++ b/mm/process_vm_access.c	2015-12-03 16:21:17.778315379 -0800
@@ -99,8 +99,10 @@ static int process_vm_rw_single_vec(unsi
 		size_t bytes;
 
 		/* Get the pages we're interested in */
-		pages = get_user_pages_unlocked(task, mm, pa, pages,
-						vm_write, 0, process_pages);
+		down_read(&mm->mmap_sem);
+		pages = get_foreign_user_pages(task, mm, pa, pages, vm_write,
+						0, process_pages, NULL);
+		up_read(&mm->mmap_sem);
 		if (pages <= 0)
 			return -EFAULT;
 
diff -puN mm/util.c~get_current_user_pages mm/util.c
--- a/mm/util.c~get_current_user_pages	2015-12-03 16:21:17.751314154 -0800
+++ b/mm/util.c	2015-12-03 16:21:17.779315424 -0800
@@ -277,9 +277,7 @@ EXPORT_SYMBOL_GPL(__get_user_pages_fast)
 int __weak get_user_pages_fast(unsigned long start,
 				int nr_pages, int write, struct page **pages)
 {
-	struct mm_struct *mm = current->mm;
-	return get_user_pages_unlocked(current, mm, start, nr_pages,
-				       write, 0, pages);
+	return get_user_pages_unlocked(start, nr_pages, write, 0, pages);
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
diff -puN net/ceph/pagevec.c~get_current_user_pages net/ceph/pagevec.c
--- a/net/ceph/pagevec.c~get_current_user_pages	2015-12-03 16:21:17.753314245 -0800
+++ b/net/ceph/pagevec.c	2015-12-03 16:21:17.779315424 -0800
@@ -24,7 +24,7 @@ struct page **ceph_get_direct_page_vecto
 		return ERR_PTR(-ENOMEM);
 
 	while (got < num_pages) {
-		rc = get_user_pages_unlocked(current, current->mm,
+		rc = get_user_pages_unlocked(
 		    (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
 		    num_pages - got, write_page, 0, pages + got);
 		if (rc < 0)
diff -puN security/tomoyo/domain.c~get_current_user_pages security/tomoyo/domain.c
--- a/security/tomoyo/domain.c~get_current_user_pages	2015-12-03 16:21:17.755314336 -0800
+++ b/security/tomoyo/domain.c	2015-12-03 16:21:17.779315424 -0800
@@ -874,7 +874,14 @@ bool tomoyo_dump_page(struct linux_binpr
 	}
 	/* Same with get_arg_page(bprm, pos, 0) in fs/exec.c */
 #ifdef CONFIG_MMU
-	if (get_user_pages(current, bprm->mm, pos, 1, 0, 1, &page, NULL) <= 0)
+	/*
+	 * This is called at execve() time in order to dig around
+	 * in the argv/environment of the new proceess
+	 * (represented by bprm).  'current' is the process doing
+	 * the execve().
+	 */
+	if (get_foreign_user_pages(current, bprm->mm, pos, 1,
+				0, 1, &page, NULL) <= 0)
 		return false;
 #else
 	page = bprm->page[pos / PAGE_SIZE];
diff -puN virt/kvm/async_pf.c~get_current_user_pages virt/kvm/async_pf.c
--- a/virt/kvm/async_pf.c~get_current_user_pages	2015-12-03 16:21:17.756314381 -0800
+++ b/virt/kvm/async_pf.c	2015-12-03 16:21:17.780315469 -0800
@@ -80,7 +80,7 @@ static void async_pf_execute(struct work
 
 	might_sleep();
 
-	get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
+	get_user_pages_unlocked(addr, 1, 1, 0, NULL);
 	kvm_async_page_present_sync(vcpu, apf);
 
 	spin_lock(&vcpu->async_pf.lock);
diff -puN virt/kvm/kvm_main.c~get_current_user_pages virt/kvm/kvm_main.c
--- a/virt/kvm/kvm_main.c~get_current_user_pages	2015-12-03 16:21:17.758314472 -0800
+++ b/virt/kvm/kvm_main.c	2015-12-03 16:21:17.781315515 -0800
@@ -1274,15 +1274,16 @@ unsigned long kvm_vcpu_gfn_to_hva_prot(s
 	return gfn_to_hva_memslot_prot(slot, gfn, writable);
 }
 
-static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
-	unsigned long start, int write, struct page **page)
+static int get_user_page_nowait(unsigned long start, int write,
+		struct page **page)
 {
 	int flags = FOLL_TOUCH | FOLL_NOWAIT | FOLL_HWPOISON | FOLL_GET;
 
 	if (write)
 		flags |= FOLL_WRITE;
 
-	return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
+	return __get_user_pages(current, current->mm, start, 1, flags, page,
+			NULL, NULL);
 }
 
 static inline int check_user_page_hwpoison(unsigned long addr)
@@ -1344,12 +1345,10 @@ static int hva_to_pfn_slow(unsigned long
 
 	if (async) {
 		down_read(&current->mm->mmap_sem);
-		npages = get_user_page_nowait(current, current->mm,
-					      addr, write_fault, page);
+		npages = get_user_page_nowait(addr, write_fault, page);
 		up_read(&current->mm->mmap_sem);
 	} else
-		npages = __get_user_pages_unlocked(current, current->mm, addr, 1,
-						   write_fault, 0, page,
+		npages = __get_user_pages_unlocked(addr, 1, write_fault, 0, page,
 						   FOLL_TOUCH|FOLL_HWPOISON);
 	if (npages != 1)
 		return npages;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 02/34] x86, fpu: add placeholder for Processor Trace XSAVE state
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

x86 Maintainers,

I submitted this independently, but it must be applied before adding
subsequent patches.  Please drop this if it has already been applied.

---

From: Dave Hansen <dave.hansen@linux.intel.com>

There is an XSAVE state component for Intel Processor Trace.  But,
we do not use it and do not expect to ever use it.

We add a placeholder in the code for it so it is not a mystery and
also so we do not need an explicit enum initialization for Protection
Keys in a moment.

Why will we never use it?  According to Andi Kleen:

	The XSAVE support assumes that there is a single buffer
	for each thread. But perf generally doesn't work this
	way, it usually has only a single perf event per CPU per
	user, and when tracing multiple threads on that CPU it
	inherits perf event buffers between different threads. So
	XSAVE per thread cannot handle this inheritance case
	directly.

	Using multiple XSAVE areas (another one per perf event)
	would defeat some of the state caching that the CPUs do.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/fpu/types.h |    1 +
 b/arch/x86/kernel/fpu/xstate.c     |   10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pt-xstate-bit arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pt-xstate-bit	2015-12-03 16:21:19.003370936 -0800
+++ b/arch/x86/include/asm/fpu/types.h	2015-12-03 16:21:19.008371163 -0800
@@ -108,6 +108,7 @@ enum xfeature {
 	XFEATURE_OPMASK,
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
+	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 
 	XFEATURE_MAX,
 };
diff -puN arch/x86/kernel/fpu/xstate.c~pt-xstate-bit arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pt-xstate-bit	2015-12-03 16:21:19.004370981 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2015-12-03 16:21:19.008371163 -0800
@@ -13,6 +13,11 @@
 
 #include <asm/tlbflush.h>
 
+/*
+ * Although we spell it out in here, the Processor Trace
+ * xfeature is completely unused.  We use other mechanisms
+ * to save/restore PT state in Linux.
+ */
 static const char *xfeature_names[] =
 {
 	"x87 floating point registers"	,
@@ -23,7 +28,7 @@ static const char *xfeature_names[] =
 	"AVX-512 opmask"		,
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
-	"unknown xstate feature"	,
+	"Processor Trace (unused)"	,
 };
 
 /*
@@ -469,7 +474,8 @@ static void check_xstate_against_struct(
 	 * numbers.
 	 */
 	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX)) {
+	    (nr >= XFEATURE_MAX) ||
+	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR)) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 02/34] x86, fpu: add placeholder for Processor Trace XSAVE state
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

x86 Maintainers,

I submitted this independently, but it must be applied before adding
subsequent patches.  Please drop this if it has already been applied.

---

From: Dave Hansen <dave.hansen@linux.intel.com>

There is an XSAVE state component for Intel Processor Trace.  But,
we do not use it and do not expect to ever use it.

We add a placeholder in the code for it so it is not a mystery and
also so we do not need an explicit enum initialization for Protection
Keys in a moment.

Why will we never use it?  According to Andi Kleen:

	The XSAVE support assumes that there is a single buffer
	for each thread. But perf generally doesn't work this
	way, it usually has only a single perf event per CPU per
	user, and when tracing multiple threads on that CPU it
	inherits perf event buffers between different threads. So
	XSAVE per thread cannot handle this inheritance case
	directly.

	Using multiple XSAVE areas (another one per perf event)
	would defeat some of the state caching that the CPUs do.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/fpu/types.h |    1 +
 b/arch/x86/kernel/fpu/xstate.c     |   10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pt-xstate-bit arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pt-xstate-bit	2015-12-03 16:21:19.003370936 -0800
+++ b/arch/x86/include/asm/fpu/types.h	2015-12-03 16:21:19.008371163 -0800
@@ -108,6 +108,7 @@ enum xfeature {
 	XFEATURE_OPMASK,
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
+	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 
 	XFEATURE_MAX,
 };
diff -puN arch/x86/kernel/fpu/xstate.c~pt-xstate-bit arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pt-xstate-bit	2015-12-03 16:21:19.004370981 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2015-12-03 16:21:19.008371163 -0800
@@ -13,6 +13,11 @@
 
 #include <asm/tlbflush.h>
 
+/*
+ * Although we spell it out in here, the Processor Trace
+ * xfeature is completely unused.  We use other mechanisms
+ * to save/restore PT state in Linux.
+ */
 static const char *xfeature_names[] =
 {
 	"x87 floating point registers"	,
@@ -23,7 +28,7 @@ static const char *xfeature_names[] =
 	"AVX-512 opmask"		,
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
-	"unknown xstate feature"	,
+	"Processor Trace (unused)"	,
 };
 
 /*
@@ -469,7 +474,8 @@ static void check_xstate_against_struct(
 	 * numbers.
 	 */
 	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX)) {
+	    (nr >= XFEATURE_MAX) ||
+	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR)) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 03/34] x86, pkeys: Add Kconfig option
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen

From: Dave Hansen <dave.hansen@linux.intel.com>

I don't have a strong opinion on whether we need a Kconfig prompt
or not.  Protection Keys has relatively little code associated
with it, and it is not a heavyweight feature to keep enabled.
However, I can imagine that folks would still appreciate being
able to disable it.

Note that, with disabled-features.h, the checks in the code
for protection keys are always the same:

	cpu_has(c, X86_FEATURE_PKU)

With the config option disabled, this essentially turns into an
#ifdef.

We will hide the prompt for now.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/Kconfig |    4 ++++
 1 file changed, 4 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-01-kconfig arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-01-kconfig	2015-12-03 16:21:19.440390755 -0800
+++ b/arch/x86/Kconfig	2015-12-03 16:21:19.444390937 -0800
@@ -1680,6 +1680,10 @@ config X86_INTEL_MPX

 	  If unsure, say N.

+config X86_INTEL_MEMORY_PROTECTION_KEYS
+	def_bool y
+	depends on CPU_SUP_INTEL && X86_64
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 03/34] x86, pkeys: Add Kconfig option
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I don't have a strong opinion on whether we need a Kconfig prompt
or not.  Protection Keys has relatively little code associated
with it, and it is not a heavyweight feature to keep enabled.
However, I can imagine that folks would still appreciate being
able to disable it.

Note that, with disabled-features.h, the checks in the code
for protection keys are always the same:

	cpu_has(c, X86_FEATURE_PKU)

With the config option disabled, this essentially turns into an
#ifdef.

We will hide the prompt for now.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/Kconfig |    4 ++++
 1 file changed, 4 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-01-kconfig arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-01-kconfig	2015-12-03 16:21:19.440390755 -0800
+++ b/arch/x86/Kconfig	2015-12-03 16:21:19.444390937 -0800
@@ -1680,6 +1680,10 @@ config X86_INTEL_MPX
 
 	  If unsure, say N.
 
+config X86_INTEL_MEMORY_PROTECTION_KEYS
+	def_bool y
+	depends on CPU_SUP_INTEL && X86_64
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 04/34] x86, pkeys: cpuid bit definition
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There are two CPUID bits for protection keys.  One is for whether
the CPU contains the feature, and the other will appear set once
the OS enables protection keys.  Specifically:

	Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
	Protection keys (and the RDPKRU/WRPKRU instructions)

This is because userspace can not see CR4 contents, but it can
see CPUID contents.

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:

	CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]

X86_FEATURE_OSPKE is "OSPKU":

	CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]

These are the first CPU features which need to look at the
ECX word in CPUID leaf 0x7, so this patch also includes
fetching that word in to the cpuinfo->x86_capability[] array.

Add it to the disabled-features mask when its config option is
off.  Even though we are not using it here, we also extend the
REQUIRED_MASK_BIT_SET() macro to keep it mirroring the
DISABLED_MASK_BIT_SET() version.

This means that in almost all code, you should use:

	cpu_has(c, X86_FEATURE_PKU)

and *not* the CONFIG option.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/cpufeature.h        |   56 ++++++++++++++++++-----------
 b/arch/x86/include/asm/disabled-features.h |   13 ++++++
 b/arch/x86/include/asm/required-features.h |    5 ++
 b/arch/x86/kernel/cpu/common.c             |    1 
 4 files changed, 54 insertions(+), 21 deletions(-)

diff -puN arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid arch/x86/include/asm/cpufeature.h
--- a/arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid	2015-12-03 16:21:19.852409441 -0800
+++ b/arch/x86/include/asm/cpufeature.h	2015-12-03 16:21:19.860409804 -0800
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	14	/* N 32-bit words worth of info */
+#define NCAPINTS	15	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -258,6 +258,10 @@
 /* AMD-defined CPU features, CPUID level 0x80000008 (ebx), word 13 */
 #define X86_FEATURE_CLZERO	(13*32+0) /* CLZERO instruction */
 
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 13 */
+#define X86_FEATURE_PKU		(14*32+ 3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE	(14*32+ 4) /* OS Protection Keys Enable */
+
 /*
  * BUG word(s)
  */
@@ -298,28 +302,38 @@ extern const char * const x86_bug_flags[
 	 test_bit(bit, (unsigned long *)((c)->x86_capability))
 
 #define REQUIRED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & REQUIRED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & REQUIRED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & REQUIRED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & REQUIRED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & REQUIRED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & REQUIRED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & REQUIRED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & REQUIRED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & REQUIRED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & REQUIRED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & REQUIRED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & REQUIRED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & REQUIRED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & REQUIRED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & REQUIRED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & REQUIRED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & REQUIRED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & REQUIRED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & REQUIRED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & REQUIRED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & REQUIRED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & REQUIRED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & REQUIRED_MASK12)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & REQUIRED_MASK13)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK14)) )
 
 #define DISABLED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & DISABLED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & DISABLED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & DISABLED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & DISABLED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & DISABLED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & DISABLED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & DISABLED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & DISABLED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & DISABLED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & DISABLED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & DISABLED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & DISABLED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & DISABLED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & DISABLED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & DISABLED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & DISABLED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & DISABLED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & DISABLED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & DISABLED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & DISABLED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & DISABLED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & DISABLED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & DISABLED_MASK12)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & DISABLED_MASK13)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK14)) )
 
 #define cpu_has(c, bit)							\
 	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :	\
diff -puN arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid arch/x86/include/asm/disabled-features.h
--- a/arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid	2015-12-03 16:21:19.854409532 -0800
+++ b/arch/x86/include/asm/disabled-features.h	2015-12-03 16:21:19.861409849 -0800
@@ -28,6 +28,14 @@
 # define DISABLE_CENTAUR_MCR	0
 #endif /* CONFIG_X86_64 */
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+# define DISABLE_PKU		(1<<(X86_FEATURE_PKU))
+# define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE))
+#else
+# define DISABLE_PKU		0
+# define DISABLE_OSPKE		0
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -41,5 +49,10 @@
 #define DISABLED_MASK7	0
 #define DISABLED_MASK8	0
 #define DISABLED_MASK9	(DISABLE_MPX)
+#define DISABLED_MASK10	0
+#define DISABLED_MASK11	0
+#define DISABLED_MASK12	0
+#define DISABLED_MASK13	0
+#define DISABLED_MASK14	(DISABLE_PKU|DISABLE_OSPKE)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff -puN arch/x86/include/asm/required-features.h~pkeys-01-cpuid arch/x86/include/asm/required-features.h
--- a/arch/x86/include/asm/required-features.h~pkeys-01-cpuid	2015-12-03 16:21:19.855409577 -0800
+++ b/arch/x86/include/asm/required-features.h	2015-12-03 16:21:19.861409849 -0800
@@ -92,5 +92,10 @@
 #define REQUIRED_MASK7	0
 #define REQUIRED_MASK8	0
 #define REQUIRED_MASK9	0
+#define REQUIRED_MASK10	0
+#define REQUIRED_MASK11	0
+#define REQUIRED_MASK12	0
+#define REQUIRED_MASK13	0
+#define REQUIRED_MASK14	0
 
 #endif /* _ASM_X86_REQUIRED_FEATURES_H */
diff -puN arch/x86/kernel/cpu/common.c~pkeys-01-cpuid arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-01-cpuid	2015-12-03 16:21:19.857409668 -0800
+++ b/arch/x86/kernel/cpu/common.c	2015-12-03 16:21:19.861409849 -0800
@@ -619,6 +619,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		cpuid_count(0x00000007, 0, &eax, &ebx, &ecx, &edx);
 
 		c->x86_capability[9] = ebx;
+		c->x86_capability[14] = ecx;
 	}
 
 	/* Extended state features: level 0x0000000d */
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 04/34] x86, pkeys: cpuid bit definition
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There are two CPUID bits for protection keys.  One is for whether
the CPU contains the feature, and the other will appear set once
the OS enables protection keys.  Specifically:

	Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
	Protection keys (and the RDPKRU/WRPKRU instructions)

This is because userspace can not see CR4 contents, but it can
see CPUID contents.

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:

	CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]

X86_FEATURE_OSPKE is "OSPKU":

	CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]

These are the first CPU features which need to look at the
ECX word in CPUID leaf 0x7, so this patch also includes
fetching that word in to the cpuinfo->x86_capability[] array.

Add it to the disabled-features mask when its config option is
off.  Even though we are not using it here, we also extend the
REQUIRED_MASK_BIT_SET() macro to keep it mirroring the
DISABLED_MASK_BIT_SET() version.

This means that in almost all code, you should use:

	cpu_has(c, X86_FEATURE_PKU)

and *not* the CONFIG option.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/cpufeature.h        |   56 ++++++++++++++++++-----------
 b/arch/x86/include/asm/disabled-features.h |   13 ++++++
 b/arch/x86/include/asm/required-features.h |    5 ++
 b/arch/x86/kernel/cpu/common.c             |    1 
 4 files changed, 54 insertions(+), 21 deletions(-)

diff -puN arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid arch/x86/include/asm/cpufeature.h
--- a/arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid	2015-12-03 16:21:19.852409441 -0800
+++ b/arch/x86/include/asm/cpufeature.h	2015-12-03 16:21:19.860409804 -0800
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	14	/* N 32-bit words worth of info */
+#define NCAPINTS	15	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -258,6 +258,10 @@
 /* AMD-defined CPU features, CPUID level 0x80000008 (ebx), word 13 */
 #define X86_FEATURE_CLZERO	(13*32+0) /* CLZERO instruction */
 
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 13 */
+#define X86_FEATURE_PKU		(14*32+ 3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE	(14*32+ 4) /* OS Protection Keys Enable */
+
 /*
  * BUG word(s)
  */
@@ -298,28 +302,38 @@ extern const char * const x86_bug_flags[
 	 test_bit(bit, (unsigned long *)((c)->x86_capability))
 
 #define REQUIRED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & REQUIRED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & REQUIRED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & REQUIRED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & REQUIRED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & REQUIRED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & REQUIRED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & REQUIRED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & REQUIRED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & REQUIRED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & REQUIRED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & REQUIRED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & REQUIRED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & REQUIRED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & REQUIRED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & REQUIRED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & REQUIRED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & REQUIRED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & REQUIRED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & REQUIRED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & REQUIRED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & REQUIRED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & REQUIRED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & REQUIRED_MASK12)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & REQUIRED_MASK13)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK14)) )
 
 #define DISABLED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & DISABLED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & DISABLED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & DISABLED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & DISABLED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & DISABLED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & DISABLED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & DISABLED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & DISABLED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & DISABLED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & DISABLED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & DISABLED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & DISABLED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & DISABLED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & DISABLED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & DISABLED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & DISABLED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & DISABLED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & DISABLED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & DISABLED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & DISABLED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & DISABLED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & DISABLED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & DISABLED_MASK12)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & DISABLED_MASK13)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK14)) )
 
 #define cpu_has(c, bit)							\
 	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :	\
diff -puN arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid arch/x86/include/asm/disabled-features.h
--- a/arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid	2015-12-03 16:21:19.854409532 -0800
+++ b/arch/x86/include/asm/disabled-features.h	2015-12-03 16:21:19.861409849 -0800
@@ -28,6 +28,14 @@
 # define DISABLE_CENTAUR_MCR	0
 #endif /* CONFIG_X86_64 */
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+# define DISABLE_PKU		(1<<(X86_FEATURE_PKU))
+# define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE))
+#else
+# define DISABLE_PKU		0
+# define DISABLE_OSPKE		0
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -41,5 +49,10 @@
 #define DISABLED_MASK7	0
 #define DISABLED_MASK8	0
 #define DISABLED_MASK9	(DISABLE_MPX)
+#define DISABLED_MASK10	0
+#define DISABLED_MASK11	0
+#define DISABLED_MASK12	0
+#define DISABLED_MASK13	0
+#define DISABLED_MASK14	(DISABLE_PKU|DISABLE_OSPKE)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff -puN arch/x86/include/asm/required-features.h~pkeys-01-cpuid arch/x86/include/asm/required-features.h
--- a/arch/x86/include/asm/required-features.h~pkeys-01-cpuid	2015-12-03 16:21:19.855409577 -0800
+++ b/arch/x86/include/asm/required-features.h	2015-12-03 16:21:19.861409849 -0800
@@ -92,5 +92,10 @@
 #define REQUIRED_MASK7	0
 #define REQUIRED_MASK8	0
 #define REQUIRED_MASK9	0
+#define REQUIRED_MASK10	0
+#define REQUIRED_MASK11	0
+#define REQUIRED_MASK12	0
+#define REQUIRED_MASK13	0
+#define REQUIRED_MASK14	0
 
 #endif /* _ASM_X86_REQUIRED_FEATURES_H */
diff -puN arch/x86/kernel/cpu/common.c~pkeys-01-cpuid arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-01-cpuid	2015-12-03 16:21:19.857409668 -0800
+++ b/arch/x86/kernel/cpu/common.c	2015-12-03 16:21:19.861409849 -0800
@@ -619,6 +619,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		cpuid_count(0x00000007, 0, &eax, &ebx, &ecx, &edx);
 
 		c->x86_capability[9] = ebx;
+		c->x86_capability[14] = ecx;
 	}
 
 	/* Extended state features: level 0x0000000d */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 05/34] x86, pkeys: define new CR4 bit
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There is a new bit in CR4 for enabling protection keys.  We
will actually enable it later in the series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/uapi/asm/processor-flags.h |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4 arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4	2015-12-03 16:21:20.345431800 -0800
+++ b/arch/x86/include/uapi/asm/processor-flags.h	2015-12-03 16:21:20.348431936 -0800
@@ -118,6 +118,8 @@
 #define X86_CR4_SMEP		_BITUL(X86_CR4_SMEP_BIT)
 #define X86_CR4_SMAP_BIT	21 /* enable SMAP support */
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
+#define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
+#define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 05/34] x86, pkeys: define new CR4 bit
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There is a new bit in CR4 for enabling protection keys.  We
will actually enable it later in the series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/uapi/asm/processor-flags.h |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4 arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4	2015-12-03 16:21:20.345431800 -0800
+++ b/arch/x86/include/uapi/asm/processor-flags.h	2015-12-03 16:21:20.348431936 -0800
@@ -118,6 +118,8 @@
 #define X86_CR4_SMEP		_BITUL(X86_CR4_SMEP_BIT)
 #define X86_CR4_SMAP_BIT	21 /* enable SMAP support */
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
+#define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
+#define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 06/34] x86, pkeys: add PKRU xsave fields and data structure(s)
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The protection keys register (PKRU) is saved and restored using
xsave.  Define the data structure that we will use to access it
inside the xsave buffer.

Note that we also have to widen the printk of the xsave feature
masks since this is feature 0x200 and we only did two characters
before.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/fpu/types.h  |   11 +++++++++++
 b/arch/x86/include/asm/fpu/xstate.h |    4 +++-
 b/arch/x86/kernel/fpu/xstate.c      |    7 ++++++-
 3 files changed, 20 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pkeys-03-xsave arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pkeys-03-xsave	2015-12-03 16:21:20.747450032 -0800
+++ b/arch/x86/include/asm/fpu/types.h	2015-12-03 16:21:20.753450304 -0800
@@ -109,6 +109,7 @@ enum xfeature {
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
+	XFEATURE_PKRU,
 
 	XFEATURE_MAX,
 };
@@ -121,6 +122,7 @@ enum xfeature {
 #define XFEATURE_MASK_OPMASK		(1 << XFEATURE_OPMASK)
 #define XFEATURE_MASK_ZMM_Hi256		(1 << XFEATURE_ZMM_Hi256)
 #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
+#define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
@@ -213,6 +215,15 @@ struct avx_512_hi16_state {
 	struct reg_512_bit		hi16_zmm[16];
 } __packed;
 
+/*
+ * State component 9: 32-bit PKRU register.  The state is
+ * 8 bytes long but only 4 bytes is used currently.
+ */
+struct pkru_state {
+	u32				pkru;
+	u32				pad;
+} __packed;
+
 struct xstate_header {
 	u64				xfeatures;
 	u64				xcomp_bv;
diff -puN arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave	2015-12-03 16:21:20.748450077 -0800
+++ b/arch/x86/include/asm/fpu/xstate.h	2015-12-03 16:21:20.754450349 -0800
@@ -27,7 +27,9 @@
 				 XFEATURE_MASK_Hi16_ZMM)
 
 /* Supported features which require eager state saving */
-#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)
+#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | \
+				 XFEATURE_MASK_BNDCSR | \
+				 XFEATURE_MASK_PKRU)
 
 /* All currently supported features */
 #define XCNTXT_MASK	(XFEATURE_MASK_LAZY | XFEATURE_MASK_EAGER)
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave	2015-12-03 16:21:20.750450168 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2015-12-03 16:21:20.754450349 -0800
@@ -29,6 +29,8 @@ static const char *xfeature_names[] =
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
 	"Processor Trace (unused)"	,
+	"Protection Keys User registers",
+	"unknown xstate feature"	,
 };
 
 /*
@@ -57,6 +59,7 @@ void fpu__xstate_clear_all_cpu_caps(void
 	setup_clear_cpu_cap(X86_FEATURE_AVX512ER);
 	setup_clear_cpu_cap(X86_FEATURE_AVX512CD);
 	setup_clear_cpu_cap(X86_FEATURE_MPX);
+	setup_clear_cpu_cap(X86_FEATURE_PKU);
 }
 
 /*
@@ -235,7 +238,7 @@ static void __init print_xstate_feature(
 	const char *feature_name;
 
 	if (cpu_has_xfeatures(xstate_mask, &feature_name))
-		pr_info("x86/fpu: Supporting XSAVE feature 0x%02Lx: '%s'\n", xstate_mask, feature_name);
+		pr_info("x86/fpu: Supporting XSAVE feature 0x%03Lx: '%s'\n", xstate_mask, feature_name);
 }
 
 /*
@@ -251,6 +254,7 @@ static void __init print_xstate_features
 	print_xstate_feature(XFEATURE_MASK_OPMASK);
 	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
+	print_xstate_feature(XFEATURE_MASK_PKRU);
 }
 
 /*
@@ -467,6 +471,7 @@ static void check_xstate_against_struct(
 	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
 	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
+	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 06/34] x86, pkeys: add PKRU xsave fields and data structure(s)
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The protection keys register (PKRU) is saved and restored using
xsave.  Define the data structure that we will use to access it
inside the xsave buffer.

Note that we also have to widen the printk of the xsave feature
masks since this is feature 0x200 and we only did two characters
before.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/fpu/types.h  |   11 +++++++++++
 b/arch/x86/include/asm/fpu/xstate.h |    4 +++-
 b/arch/x86/kernel/fpu/xstate.c      |    7 ++++++-
 3 files changed, 20 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pkeys-03-xsave arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pkeys-03-xsave	2015-12-03 16:21:20.747450032 -0800
+++ b/arch/x86/include/asm/fpu/types.h	2015-12-03 16:21:20.753450304 -0800
@@ -109,6 +109,7 @@ enum xfeature {
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
+	XFEATURE_PKRU,
 
 	XFEATURE_MAX,
 };
@@ -121,6 +122,7 @@ enum xfeature {
 #define XFEATURE_MASK_OPMASK		(1 << XFEATURE_OPMASK)
 #define XFEATURE_MASK_ZMM_Hi256		(1 << XFEATURE_ZMM_Hi256)
 #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
+#define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
@@ -213,6 +215,15 @@ struct avx_512_hi16_state {
 	struct reg_512_bit		hi16_zmm[16];
 } __packed;
 
+/*
+ * State component 9: 32-bit PKRU register.  The state is
+ * 8 bytes long but only 4 bytes is used currently.
+ */
+struct pkru_state {
+	u32				pkru;
+	u32				pad;
+} __packed;
+
 struct xstate_header {
 	u64				xfeatures;
 	u64				xcomp_bv;
diff -puN arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave	2015-12-03 16:21:20.748450077 -0800
+++ b/arch/x86/include/asm/fpu/xstate.h	2015-12-03 16:21:20.754450349 -0800
@@ -27,7 +27,9 @@
 				 XFEATURE_MASK_Hi16_ZMM)
 
 /* Supported features which require eager state saving */
-#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)
+#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | \
+				 XFEATURE_MASK_BNDCSR | \
+				 XFEATURE_MASK_PKRU)
 
 /* All currently supported features */
 #define XCNTXT_MASK	(XFEATURE_MASK_LAZY | XFEATURE_MASK_EAGER)
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave	2015-12-03 16:21:20.750450168 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2015-12-03 16:21:20.754450349 -0800
@@ -29,6 +29,8 @@ static const char *xfeature_names[] =
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
 	"Processor Trace (unused)"	,
+	"Protection Keys User registers",
+	"unknown xstate feature"	,
 };
 
 /*
@@ -57,6 +59,7 @@ void fpu__xstate_clear_all_cpu_caps(void
 	setup_clear_cpu_cap(X86_FEATURE_AVX512ER);
 	setup_clear_cpu_cap(X86_FEATURE_AVX512CD);
 	setup_clear_cpu_cap(X86_FEATURE_MPX);
+	setup_clear_cpu_cap(X86_FEATURE_PKU);
 }
 
 /*
@@ -235,7 +238,7 @@ static void __init print_xstate_feature(
 	const char *feature_name;
 
 	if (cpu_has_xfeatures(xstate_mask, &feature_name))
-		pr_info("x86/fpu: Supporting XSAVE feature 0x%02Lx: '%s'\n", xstate_mask, feature_name);
+		pr_info("x86/fpu: Supporting XSAVE feature 0x%03Lx: '%s'\n", xstate_mask, feature_name);
 }
 
 /*
@@ -251,6 +254,7 @@ static void __init print_xstate_features
 	print_xstate_feature(XFEATURE_MASK_OPMASK);
 	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
+	print_xstate_feature(XFEATURE_MASK_PKRU);
 }
 
 /*
@@ -467,6 +471,7 @@ static void check_xstate_against_struct(
 	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
 	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
+	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 07/34] x86, pkeys: PTE bits for storing protection key
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Previous documentation has referred to these 4 bits as "ignored".
That means that software could have made use of them.  But, as
far as I know, the kernel never used them.

They are still ignored when protection keys is not enabled, so
they could theoretically still get used for software purposes.

We also implement "empty" versions so that code that references
to them can be optimized away by the compiler when the config
option is not enabled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/pgtable_types.h |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits	2015-12-03 16:21:21.207470895 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2015-12-03 16:21:21.211471076 -0800
@@ -25,7 +25,11 @@
 #define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
+#define _PAGE_BIT_PKEY_BIT0	59       /* Protection Keys, bit 1/4 */
+#define _PAGE_BIT_PKEY_BIT1	60       /* Protection Keys, bit 2/4 */
+#define _PAGE_BIT_PKEY_BIT2	61       /* Protection Keys, bit 3/4 */
+#define _PAGE_BIT_PKEY_BIT3	62       /* Protection Keys, bit 4/4 */
+#define _PAGE_BIT_NX		63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -47,6 +51,17 @@
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT2)
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT3)
+#else
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 0))
+#endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 07/34] x86, pkeys: PTE bits for storing protection key
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Previous documentation has referred to these 4 bits as "ignored".
That means that software could have made use of them.  But, as
far as I know, the kernel never used them.

They are still ignored when protection keys is not enabled, so
they could theoretically still get used for software purposes.

We also implement "empty" versions so that code that references
to them can be optimized away by the compiler when the config
option is not enabled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/pgtable_types.h |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits	2015-12-03 16:21:21.207470895 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2015-12-03 16:21:21.211471076 -0800
@@ -25,7 +25,11 @@
 #define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
+#define _PAGE_BIT_PKEY_BIT0	59       /* Protection Keys, bit 1/4 */
+#define _PAGE_BIT_PKEY_BIT1	60       /* Protection Keys, bit 2/4 */
+#define _PAGE_BIT_PKEY_BIT2	61       /* Protection Keys, bit 3/4 */
+#define _PAGE_BIT_PKEY_BIT3	62       /* Protection Keys, bit 4/4 */
+#define _PAGE_BIT_NX		63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -47,6 +51,17 @@
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT2)
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT3)
+#else
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 0))
+#endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 08/34] x86, pkeys: new page fault error code bit: PF_PK
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Note: "PK" is how the Intel SDM refers to this bit, so we also
use that nomenclature.

This only defines the bit, it does not plumb it anywhere to be
handled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/fault.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-05-pfec arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-05-pfec	2015-12-03 16:21:21.619489580 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:21.622489716 -0800
@@ -33,6 +33,7 @@
  *   bit 2 ==	 0: kernel-mode access	1: user-mode access
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
+ *   bit 5 ==				1: protection keys block access
  */
 enum x86_pf_error_code {
 
@@ -41,6 +42,7 @@ enum x86_pf_error_code {
 	PF_USER		=		1 << 2,
 	PF_RSVD		=		1 << 3,
 	PF_INSTR	=		1 << 4,
+	PF_PK		=		1 << 5,
 };
 
 /*
@@ -916,6 +918,12 @@ static int spurious_fault_check(unsigned
 
 	if ((error_code & PF_INSTR) && !pte_exec(*pte))
 		return 0;
+	/*
+	 * Note: We do not do lazy flushing on protection key
+	 * changes, so no spurious fault will ever set PF_PK.
+	 */
+	if ((error_code & PF_PK))
+		return 1;
 
 	return 1;
 }
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 08/34] x86, pkeys: new page fault error code bit: PF_PK
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Note: "PK" is how the Intel SDM refers to this bit, so we also
use that nomenclature.

This only defines the bit, it does not plumb it anywhere to be
handled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/fault.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-05-pfec arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-05-pfec	2015-12-03 16:21:21.619489580 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:21.622489716 -0800
@@ -33,6 +33,7 @@
  *   bit 2 ==	 0: kernel-mode access	1: user-mode access
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
+ *   bit 5 ==				1: protection keys block access
  */
 enum x86_pf_error_code {
 
@@ -41,6 +42,7 @@ enum x86_pf_error_code {
 	PF_USER		=		1 << 2,
 	PF_RSVD		=		1 << 3,
 	PF_INSTR	=		1 << 4,
+	PF_PK		=		1 << 5,
 };
 
 /*
@@ -916,6 +918,12 @@ static int spurious_fault_check(unsigned
 
 	if ((error_code & PF_INSTR) && !pte_exec(*pte))
 		return 0;
+	/*
+	 * Note: We do not do lazy flushing on protection key
+	 * changes, so no spurious fault will ever set PF_PK.
+	 */
+	if ((error_code & PF_PK))
+		return 1;
 
 	return 1;
 }
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 09/34] x86, pkeys: store protection in high VMA flags
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

vma->vm_flags is an 'unsigned long', so has space for 32 flags
on 32-bit architectures.  The high 32 bits are unused on 64-bit
platforms.  We've steered away from using the unused high VMA
bits for things because we would have difficulty supporting it
on 32-bit.

Protection Keys are not available in 32-bit mode, so there is
no concern about supporting this feature in 32-bit mode or on
32-bit CPUs.

This patch carves out 4 bits from the high half of
vma->vm_flags and allows architectures to set config option
to make them available.

Sparse complains about these constants unless we explicitly
call them "UL".

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig   |    1 +
 b/include/linux/mm.h |    7 +++++++
 b/mm/Kconfig         |    3 +++
 3 files changed, 11 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-07-eat-high-vma-flags arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-07-eat-high-vma-flags	2015-12-03 16:21:22.042508764 -0800
+++ b/arch/x86/Kconfig	2015-12-03 16:21:22.050509127 -0800
@@ -152,6 +152,7 @@ config X86
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN include/linux/mm.h~pkeys-07-eat-high-vma-flags include/linux/mm.h
--- a/include/linux/mm.h~pkeys-07-eat-high-vma-flags	2015-12-03 16:21:22.044508855 -0800
+++ b/include/linux/mm.h	2015-12-03 16:21:22.051509173 -0800
@@ -158,6 +158,13 @@ extern unsigned int kobjsize(const void
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_HIGH_ARCH_0  0x100000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_1  0x200000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_2  0x400000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_3  0x800000000UL	/* bit only usable on 64-bit architectures */
+#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff -puN mm/Kconfig~pkeys-07-eat-high-vma-flags mm/Kconfig
--- a/mm/Kconfig~pkeys-07-eat-high-vma-flags	2015-12-03 16:21:22.046508946 -0800
+++ b/mm/Kconfig	2015-12-03 16:21:22.051509173 -0800
@@ -668,3 +668,6 @@ config ZONE_DEVICE
 
 config FRAME_VECTOR
 	bool
+
+config ARCH_USES_HIGH_VMA_FLAGS
+	bool
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 09/34] x86, pkeys: store protection in high VMA flags
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

vma->vm_flags is an 'unsigned long', so has space for 32 flags
on 32-bit architectures.  The high 32 bits are unused on 64-bit
platforms.  We've steered away from using the unused high VMA
bits for things because we would have difficulty supporting it
on 32-bit.

Protection Keys are not available in 32-bit mode, so there is
no concern about supporting this feature in 32-bit mode or on
32-bit CPUs.

This patch carves out 4 bits from the high half of
vma->vm_flags and allows architectures to set config option
to make them available.

Sparse complains about these constants unless we explicitly
call them "UL".

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig   |    1 +
 b/include/linux/mm.h |    7 +++++++
 b/mm/Kconfig         |    3 +++
 3 files changed, 11 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-07-eat-high-vma-flags arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-07-eat-high-vma-flags	2015-12-03 16:21:22.042508764 -0800
+++ b/arch/x86/Kconfig	2015-12-03 16:21:22.050509127 -0800
@@ -152,6 +152,7 @@ config X86
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN include/linux/mm.h~pkeys-07-eat-high-vma-flags include/linux/mm.h
--- a/include/linux/mm.h~pkeys-07-eat-high-vma-flags	2015-12-03 16:21:22.044508855 -0800
+++ b/include/linux/mm.h	2015-12-03 16:21:22.051509173 -0800
@@ -158,6 +158,13 @@ extern unsigned int kobjsize(const void
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_HIGH_ARCH_0  0x100000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_1  0x200000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_2  0x400000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_3  0x800000000UL	/* bit only usable on 64-bit architectures */
+#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff -puN mm/Kconfig~pkeys-07-eat-high-vma-flags mm/Kconfig
--- a/mm/Kconfig~pkeys-07-eat-high-vma-flags	2015-12-03 16:21:22.046508946 -0800
+++ b/mm/Kconfig	2015-12-03 16:21:22.051509173 -0800
@@ -668,3 +668,6 @@ config ZONE_DEVICE
 
 config FRAME_VECTOR
 	bool
+
+config ARCH_USES_HIGH_VMA_FLAGS
+	bool
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 10/34] x86, pkeys: arch-specific protection bits
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Lots of things seem to do:

        vma->vm_page_prot = vm_get_page_prot(flags);

and the ptes get created right from things we pull out
of ->vm_page_prot.  So it is very convenient if we can
store the protection key in flags and vm_page_prot, just
like the existing permission bits (_PAGE_RW/PRESENT).  It
greatly reduces the amount of plumbing and arch-specific
hacking we have to do in generic code.

This also takes the new PROT_PKEY{0,1,2,3} flags and
turns *those* in to VM_ flags for vma->vm_flags.

The protection key values are stored in 4 places:
	1. "prot" argument to system calls
	2. vma->vm_flags, filled from the mmap "prot"
	3. vma->vm_page prot, filled from vma->vm_flags
	4. the PTE itself.

The pseudocode for these for steps are as follows:

	mmap(PROT_PKEY*)
	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
	pte = pfn | vma->vm_page_prot

Note that this provides a new definitions for x86:

	arch_vm_get_page_prot()

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/mmu_context.h   |   20 ++++++++++++++++++++
 b/arch/x86/include/asm/pgtable_types.h |   12 ++++++++++--
 b/arch/x86/include/uapi/asm/mman.h     |   16 ++++++++++++++++
 b/include/linux/mm.h                   |    6 ++++++
 4 files changed, 52 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-08-store-pkey-in-vma	2015-12-03 16:21:22.505529763 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:22.514530171 -0800
@@ -243,4 +243,24 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+	u16 pkey = 0;
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
+				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
+	/*
+	 * ffs is one-based, not zero-based, so bias back down by 1.
+	 */
+	int vm_pkey_shift = __builtin_ffsl(vma_pkey_mask) - 1;
+	/*
+	 * gcc generates better code if we do this rather than:
+	 * pkey = (flags & mask) >> shift
+	 */
+	pkey = (vma->vm_flags >> vm_pkey_shift) &
+	       (vma_pkey_mask >> vm_pkey_shift);
+#endif
+	return pkey;
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma	2015-12-03 16:21:22.507529853 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2015-12-03 16:21:22.514530171 -0800
@@ -111,7 +111,12 @@
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
 			 _PAGE_DIRTY)
 
-/* Set of bits not changed in pte_modify */
+/*
+ * Set of bits not changed in pte_modify.  The pte's
+ * protection key is treated like _PAGE_RW, for
+ * instance, and is *not* included in this mask since
+ * pte_modify() does modify it.
+ */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY)
@@ -227,7 +232,10 @@ enum page_cache_mode {
 /* Extracts the PFN from a (pte|pmd|pud|pgd)val_t of a 4KB page */
 #define PTE_PFN_MASK		((pteval_t)PHYSICAL_PAGE_MASK)
 
-/* Extracts the flags from a (pte|pmd|pud|pgd)val_t of a 4KB page */
+/*
+ *  Extracts the flags from a (pte|pmd|pud|pgd)val_t
+ *  This includes the protection key value.
+ */
 #define PTE_FLAGS_MASK		(~PTE_PFN_MASK)
 
 typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma	2015-12-03 16:21:22.509529944 -0800
+++ b/arch/x86/include/uapi/asm/mman.h	2015-12-03 16:21:22.514530171 -0800
@@ -6,6 +6,22 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ *
+ * Only override these if Protection Keys are available
+ * (which is only on 64-bit).
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+#endif
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff -puN include/linux/mm.h~pkeys-08-store-pkey-in-vma include/linux/mm.h
--- a/include/linux/mm.h~pkeys-08-store-pkey-in-vma	2015-12-03 16:21:22.510529990 -0800
+++ b/include/linux/mm.h	2015-12-03 16:21:22.515530216 -0800
@@ -167,6 +167,12 @@ extern unsigned int kobjsize(const void
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
+#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
+# define VM_PKEY_BIT1	VM_HIGH_ARCH_1
+# define VM_PKEY_BIT2	VM_HIGH_ARCH_2
+# define VM_PKEY_BIT3	VM_HIGH_ARCH_3
+#endif
 #elif defined(CONFIG_PPC)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 10/34] x86, pkeys: arch-specific protection bits
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Lots of things seem to do:

        vma->vm_page_prot = vm_get_page_prot(flags);

and the ptes get created right from things we pull out
of ->vm_page_prot.  So it is very convenient if we can
store the protection key in flags and vm_page_prot, just
like the existing permission bits (_PAGE_RW/PRESENT).  It
greatly reduces the amount of plumbing and arch-specific
hacking we have to do in generic code.

This also takes the new PROT_PKEY{0,1,2,3} flags and
turns *those* in to VM_ flags for vma->vm_flags.

The protection key values are stored in 4 places:
	1. "prot" argument to system calls
	2. vma->vm_flags, filled from the mmap "prot"
	3. vma->vm_page prot, filled from vma->vm_flags
	4. the PTE itself.

The pseudocode for these for steps are as follows:

	mmap(PROT_PKEY*)
	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
	pte = pfn | vma->vm_page_prot

Note that this provides a new definitions for x86:

	arch_vm_get_page_prot()

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/mmu_context.h   |   20 ++++++++++++++++++++
 b/arch/x86/include/asm/pgtable_types.h |   12 ++++++++++--
 b/arch/x86/include/uapi/asm/mman.h     |   16 ++++++++++++++++
 b/include/linux/mm.h                   |    6 ++++++
 4 files changed, 52 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-08-store-pkey-in-vma	2015-12-03 16:21:22.505529763 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:22.514530171 -0800
@@ -243,4 +243,24 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+	u16 pkey = 0;
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
+				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
+	/*
+	 * ffs is one-based, not zero-based, so bias back down by 1.
+	 */
+	int vm_pkey_shift = __builtin_ffsl(vma_pkey_mask) - 1;
+	/*
+	 * gcc generates better code if we do this rather than:
+	 * pkey = (flags & mask) >> shift
+	 */
+	pkey = (vma->vm_flags >> vm_pkey_shift) &
+	       (vma_pkey_mask >> vm_pkey_shift);
+#endif
+	return pkey;
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma	2015-12-03 16:21:22.507529853 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2015-12-03 16:21:22.514530171 -0800
@@ -111,7 +111,12 @@
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
 			 _PAGE_DIRTY)
 
-/* Set of bits not changed in pte_modify */
+/*
+ * Set of bits not changed in pte_modify.  The pte's
+ * protection key is treated like _PAGE_RW, for
+ * instance, and is *not* included in this mask since
+ * pte_modify() does modify it.
+ */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY)
@@ -227,7 +232,10 @@ enum page_cache_mode {
 /* Extracts the PFN from a (pte|pmd|pud|pgd)val_t of a 4KB page */
 #define PTE_PFN_MASK		((pteval_t)PHYSICAL_PAGE_MASK)
 
-/* Extracts the flags from a (pte|pmd|pud|pgd)val_t of a 4KB page */
+/*
+ *  Extracts the flags from a (pte|pmd|pud|pgd)val_t
+ *  This includes the protection key value.
+ */
 #define PTE_FLAGS_MASK		(~PTE_PFN_MASK)
 
 typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma	2015-12-03 16:21:22.509529944 -0800
+++ b/arch/x86/include/uapi/asm/mman.h	2015-12-03 16:21:22.514530171 -0800
@@ -6,6 +6,22 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ *
+ * Only override these if Protection Keys are available
+ * (which is only on 64-bit).
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+#endif
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff -puN include/linux/mm.h~pkeys-08-store-pkey-in-vma include/linux/mm.h
--- a/include/linux/mm.h~pkeys-08-store-pkey-in-vma	2015-12-03 16:21:22.510529990 -0800
+++ b/include/linux/mm.h	2015-12-03 16:21:22.515530216 -0800
@@ -167,6 +167,12 @@ extern unsigned int kobjsize(const void
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
+#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
+# define VM_PKEY_BIT1	VM_HIGH_ARCH_1
+# define VM_PKEY_BIT2	VM_HIGH_ARCH_2
+# define VM_PKEY_BIT3	VM_HIGH_ARCH_3
+#endif
 #elif defined(CONFIG_PPC)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 11/34] x86, pkeys: pass VMA down in to fault signal generation code
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

During a page fault, we look up the VMA to ensure that the fault
is in a region with a valid mapping.  But, in the top-level page
fault code we don't need the VMA for much else.  Once we have
decided that an access is bad, we are going to send a signal no
matter what and do not need the VMA any more.  So we do not pass
it down in to the signal generation code.

But, for protection keys, we need the VMA.  It tells us *which*
protection key we violated if we get a PF_PK.  So, we need to
pass the VMA down and fill in siginfo->si_pkey.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/fault.c |   50 ++++++++++++++++++++++++++++----------------------
 1 file changed, 28 insertions(+), 22 deletions(-)

diff -puN arch/x86/mm/fault.c~pkeys-08-pass-down-vma arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-08-pass-down-vma	2015-12-03 16:21:22.995551986 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:22.998552122 -0800
@@ -171,7 +171,8 @@ is_prefetch(struct pt_regs *regs, unsign
 
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
-		     struct task_struct *tsk, int fault)
+		     struct task_struct *tsk, struct vm_area_struct *vma,
+		     int fault)
 {
 	unsigned lsb = 0;
 	siginfo_t info;
@@ -656,6 +657,8 @@ no_context(struct pt_regs *regs, unsigne
 	struct task_struct *tsk = current;
 	unsigned long flags;
 	int sig;
+	/* No context means no VMA to pass down */
+	struct vm_area_struct *vma = NULL;
 
 	/* Are we prepared to handle this kernel fault? */
 	if (fixup_exception(regs)) {
@@ -679,7 +682,8 @@ no_context(struct pt_regs *regs, unsigne
 			tsk->thread.cr2 = address;
 
 			/* XXX: hwpoison faults will set the wrong code. */
-			force_sig_info_fault(signal, si_code, address, tsk, 0);
+			force_sig_info_fault(signal, si_code, address,
+					     tsk, vma, 0);
 		}
 
 		/*
@@ -756,7 +760,8 @@ show_signal_msg(struct pt_regs *regs, un
 
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		       unsigned long address, int si_code)
+		       unsigned long address, struct vm_area_struct *vma,
+		       int si_code)
 {
 	struct task_struct *tsk = current;
 
@@ -799,7 +804,7 @@ __bad_area_nosemaphore(struct pt_regs *r
 		tsk->thread.error_code	= error_code;
 		tsk->thread.trap_nr	= X86_TRAP_PF;
 
-		force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
+		force_sig_info_fault(SIGSEGV, si_code, address, tsk, vma, 0);
 
 		return;
 	}
@@ -812,14 +817,14 @@ __bad_area_nosemaphore(struct pt_regs *r
 
 static noinline void
 bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		     unsigned long address)
+		     unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
+	__bad_area_nosemaphore(regs, error_code, address, vma, SEGV_MAPERR);
 }
 
 static void
 __bad_area(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int si_code)
+	   unsigned long address,  struct vm_area_struct *vma, int si_code)
 {
 	struct mm_struct *mm = current->mm;
 
@@ -829,25 +834,25 @@ __bad_area(struct pt_regs *regs, unsigne
 	 */
 	up_read(&mm->mmap_sem);
 
-	__bad_area_nosemaphore(regs, error_code, address, si_code);
+	__bad_area_nosemaphore(regs, error_code, address, vma, si_code);
 }
 
 static noinline void
 bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 {
-	__bad_area(regs, error_code, address, SEGV_MAPERR);
+	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
-		      unsigned long address)
+		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void
 do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
-	  unsigned int fault)
+	  struct vm_area_struct *vma, unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	int code = BUS_ADRERR;
@@ -874,12 +879,13 @@ do_sigbus(struct pt_regs *regs, unsigned
 		code = BUS_MCEERR_AR;
 	}
 #endif
-	force_sig_info_fault(SIGBUS, code, address, tsk, fault);
+	force_sig_info_fault(SIGBUS, code, address, tsk, vma, fault);
 }
 
 static noinline void
 mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-	       unsigned long address, unsigned int fault)
+	       unsigned long address, struct vm_area_struct *vma,
+	       unsigned int fault)
 {
 	if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
 		no_context(regs, error_code, address, 0, 0);
@@ -903,9 +909,9 @@ mm_fault_error(struct pt_regs *regs, uns
 	} else {
 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
 			     VM_FAULT_HWPOISON_LARGE))
-			do_sigbus(regs, error_code, address, fault);
+			do_sigbus(regs, error_code, address, vma, fault);
 		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, vma);
 		else
 			BUG();
 	}
@@ -1119,7 +1125,7 @@ __do_page_fault(struct pt_regs *regs, un
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock:
 		 */
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 
 		return;
 	}
@@ -1132,7 +1138,7 @@ __do_page_fault(struct pt_regs *regs, un
 		pgtable_bad(regs, error_code, address);
 
 	if (unlikely(smap_violation(error_code, regs))) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1141,7 +1147,7 @@ __do_page_fault(struct pt_regs *regs, un
 	 * in a region with pagefaults disabled then we must not take the fault
 	 */
 	if (unlikely(faulthandler_disabled() || !mm)) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1185,7 +1191,7 @@ __do_page_fault(struct pt_regs *regs, un
 	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
 		if ((error_code & PF_USER) == 0 &&
 		    !search_exception_tables(regs->ip)) {
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, NULL);
 			return;
 		}
 retry:
@@ -1233,7 +1239,7 @@ retry:
 	 */
 good_area:
 	if (unlikely(access_error(error_code, vma))) {
-		bad_area_access_error(regs, error_code, address);
+		bad_area_access_error(regs, error_code, address, vma);
 		return;
 	}
 
@@ -1271,7 +1277,7 @@ good_area:
 
 	up_read(&mm->mmap_sem);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, error_code, address, fault);
+		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
 	}
 
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 11/34] x86, pkeys: pass VMA down in to fault signal generation code
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

During a page fault, we look up the VMA to ensure that the fault
is in a region with a valid mapping.  But, in the top-level page
fault code we don't need the VMA for much else.  Once we have
decided that an access is bad, we are going to send a signal no
matter what and do not need the VMA any more.  So we do not pass
it down in to the signal generation code.

But, for protection keys, we need the VMA.  It tells us *which*
protection key we violated if we get a PF_PK.  So, we need to
pass the VMA down and fill in siginfo->si_pkey.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/fault.c |   50 ++++++++++++++++++++++++++++----------------------
 1 file changed, 28 insertions(+), 22 deletions(-)

diff -puN arch/x86/mm/fault.c~pkeys-08-pass-down-vma arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-08-pass-down-vma	2015-12-03 16:21:22.995551986 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:22.998552122 -0800
@@ -171,7 +171,8 @@ is_prefetch(struct pt_regs *regs, unsign
 
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
-		     struct task_struct *tsk, int fault)
+		     struct task_struct *tsk, struct vm_area_struct *vma,
+		     int fault)
 {
 	unsigned lsb = 0;
 	siginfo_t info;
@@ -656,6 +657,8 @@ no_context(struct pt_regs *regs, unsigne
 	struct task_struct *tsk = current;
 	unsigned long flags;
 	int sig;
+	/* No context means no VMA to pass down */
+	struct vm_area_struct *vma = NULL;
 
 	/* Are we prepared to handle this kernel fault? */
 	if (fixup_exception(regs)) {
@@ -679,7 +682,8 @@ no_context(struct pt_regs *regs, unsigne
 			tsk->thread.cr2 = address;
 
 			/* XXX: hwpoison faults will set the wrong code. */
-			force_sig_info_fault(signal, si_code, address, tsk, 0);
+			force_sig_info_fault(signal, si_code, address,
+					     tsk, vma, 0);
 		}
 
 		/*
@@ -756,7 +760,8 @@ show_signal_msg(struct pt_regs *regs, un
 
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		       unsigned long address, int si_code)
+		       unsigned long address, struct vm_area_struct *vma,
+		       int si_code)
 {
 	struct task_struct *tsk = current;
 
@@ -799,7 +804,7 @@ __bad_area_nosemaphore(struct pt_regs *r
 		tsk->thread.error_code	= error_code;
 		tsk->thread.trap_nr	= X86_TRAP_PF;
 
-		force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
+		force_sig_info_fault(SIGSEGV, si_code, address, tsk, vma, 0);
 
 		return;
 	}
@@ -812,14 +817,14 @@ __bad_area_nosemaphore(struct pt_regs *r
 
 static noinline void
 bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		     unsigned long address)
+		     unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
+	__bad_area_nosemaphore(regs, error_code, address, vma, SEGV_MAPERR);
 }
 
 static void
 __bad_area(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int si_code)
+	   unsigned long address,  struct vm_area_struct *vma, int si_code)
 {
 	struct mm_struct *mm = current->mm;
 
@@ -829,25 +834,25 @@ __bad_area(struct pt_regs *regs, unsigne
 	 */
 	up_read(&mm->mmap_sem);
 
-	__bad_area_nosemaphore(regs, error_code, address, si_code);
+	__bad_area_nosemaphore(regs, error_code, address, vma, si_code);
 }
 
 static noinline void
 bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 {
-	__bad_area(regs, error_code, address, SEGV_MAPERR);
+	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
-		      unsigned long address)
+		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void
 do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
-	  unsigned int fault)
+	  struct vm_area_struct *vma, unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	int code = BUS_ADRERR;
@@ -874,12 +879,13 @@ do_sigbus(struct pt_regs *regs, unsigned
 		code = BUS_MCEERR_AR;
 	}
 #endif
-	force_sig_info_fault(SIGBUS, code, address, tsk, fault);
+	force_sig_info_fault(SIGBUS, code, address, tsk, vma, fault);
 }
 
 static noinline void
 mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-	       unsigned long address, unsigned int fault)
+	       unsigned long address, struct vm_area_struct *vma,
+	       unsigned int fault)
 {
 	if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
 		no_context(regs, error_code, address, 0, 0);
@@ -903,9 +909,9 @@ mm_fault_error(struct pt_regs *regs, uns
 	} else {
 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
 			     VM_FAULT_HWPOISON_LARGE))
-			do_sigbus(regs, error_code, address, fault);
+			do_sigbus(regs, error_code, address, vma, fault);
 		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, vma);
 		else
 			BUG();
 	}
@@ -1119,7 +1125,7 @@ __do_page_fault(struct pt_regs *regs, un
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock:
 		 */
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 
 		return;
 	}
@@ -1132,7 +1138,7 @@ __do_page_fault(struct pt_regs *regs, un
 		pgtable_bad(regs, error_code, address);
 
 	if (unlikely(smap_violation(error_code, regs))) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1141,7 +1147,7 @@ __do_page_fault(struct pt_regs *regs, un
 	 * in a region with pagefaults disabled then we must not take the fault
 	 */
 	if (unlikely(faulthandler_disabled() || !mm)) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1185,7 +1191,7 @@ __do_page_fault(struct pt_regs *regs, un
 	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
 		if ((error_code & PF_USER) == 0 &&
 		    !search_exception_tables(regs->ip)) {
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, NULL);
 			return;
 		}
 retry:
@@ -1233,7 +1239,7 @@ retry:
 	 */
 good_area:
 	if (unlikely(access_error(error_code, vma))) {
-		bad_area_access_error(regs, error_code, address);
+		bad_area_access_error(regs, error_code, address, vma);
 		return;
 	}
 
@@ -1271,7 +1277,7 @@ good_area:
 
 	up_read(&mm->mmap_sem);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, error_code, address, fault);
+		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
 	}
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 12/34] signals, pkeys: notify userspace about protection key faults
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

A protection key fault is very similar to any other access error.
There must be a VMA, etc...  We even want to take the same action
(SIGSEGV) that we do with a normal access fault.

However, we do need to let userspace know that something is
different.  We do this the same way what we did with SEGV_BNDERR
with Memory Protection eXtensions (MPX): define a new SEGV code:
SEGV_PKUERR.

We add a siginfo field: si_pkey that reveals to userspace which
protection key was set on the PTE that we faulted on.  There is
no other easy way for userspace to figure this out.  They could
parse smaps but that would be a bit cruel.

We share space with in siginfo with _addr_bnd.  #BR faults from
MPX are completely separate from page faults (#PF) that trigger
from protection key violations, so we never need both at the same
time.

Note that _pkey is a 64-bit value.  The current hardware only
supports 4-bit protection keys.  We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.

The x86 code to actually fill in the siginfo is in the next
patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/include/uapi/asm-generic/siginfo.h |   17 ++++++++++++-----
 b/kernel/signal.c                    |    4 ++++
 2 files changed, 16 insertions(+), 5 deletions(-)

diff -puN include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo-core include/uapi/asm-generic/siginfo.h
--- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo-core	2015-12-03 16:21:23.412570898 -0800
+++ b/include/uapi/asm-generic/siginfo.h	2015-12-03 16:21:23.417571125 -0800
@@ -91,10 +91,15 @@ typedef struct siginfo {
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
 			short _addr_lsb; /* LSB of the reported address */
-			struct {
-				void __user *_lower;
-				void __user *_upper;
-			} _addr_bnd;
+			union {
+				/* used when si_code=SEGV_BNDERR */
+				struct {
+					void __user *_lower;
+					void __user *_upper;
+				} _addr_bnd;
+				/* used when si_code=SEGV_PKUERR */
+				u64 _pkey;
+			};
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -137,6 +142,7 @@ typedef struct siginfo {
 #define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_lower	_sifields._sigfault._addr_bnd._lower
 #define si_upper	_sifields._sigfault._addr_bnd._upper
+#define si_pkey		_sifields._sigfault._pkey
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 #ifdef __ARCH_SIGSYS
@@ -206,7 +212,8 @@ typedef struct siginfo {
 #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
 #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
 #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
-#define NSIGSEGV	3
+#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection key checks */
+#define NSIGSEGV	4
 
 /*
  * SIGBUS si_codes
diff -puN kernel/signal.c~pkeys-09-siginfo-core kernel/signal.c
--- a/kernel/signal.c~pkeys-09-siginfo-core	2015-12-03 16:21:23.414570989 -0800
+++ b/kernel/signal.c	2015-12-03 16:21:23.418571170 -0800
@@ -2709,6 +2709,10 @@ int copy_siginfo_to_user(siginfo_t __use
 			err |= __put_user(from->si_upper, &to->si_upper);
 		}
 #endif
+#ifdef SEGV_PKUERR
+		if (from->si_signo == SIGSEGV && from->si_code == SEGV_PKUERR)
+			err |= __put_user(from->si_pkey, &to->si_pkey);
+#endif
 		break;
 	case __SI_CHLD:
 		err |= __put_user(from->si_pid, &to->si_pid);
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 12/34] signals, pkeys: notify userspace about protection key faults
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

A protection key fault is very similar to any other access error.
There must be a VMA, etc...  We even want to take the same action
(SIGSEGV) that we do with a normal access fault.

However, we do need to let userspace know that something is
different.  We do this the same way what we did with SEGV_BNDERR
with Memory Protection eXtensions (MPX): define a new SEGV code:
SEGV_PKUERR.

We add a siginfo field: si_pkey that reveals to userspace which
protection key was set on the PTE that we faulted on.  There is
no other easy way for userspace to figure this out.  They could
parse smaps but that would be a bit cruel.

We share space with in siginfo with _addr_bnd.  #BR faults from
MPX are completely separate from page faults (#PF) that trigger
from protection key violations, so we never need both at the same
time.

Note that _pkey is a 64-bit value.  The current hardware only
supports 4-bit protection keys.  We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.

The x86 code to actually fill in the siginfo is in the next
patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/include/uapi/asm-generic/siginfo.h |   17 ++++++++++++-----
 b/kernel/signal.c                    |    4 ++++
 2 files changed, 16 insertions(+), 5 deletions(-)

diff -puN include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo-core include/uapi/asm-generic/siginfo.h
--- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo-core	2015-12-03 16:21:23.412570898 -0800
+++ b/include/uapi/asm-generic/siginfo.h	2015-12-03 16:21:23.417571125 -0800
@@ -91,10 +91,15 @@ typedef struct siginfo {
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
 			short _addr_lsb; /* LSB of the reported address */
-			struct {
-				void __user *_lower;
-				void __user *_upper;
-			} _addr_bnd;
+			union {
+				/* used when si_code=SEGV_BNDERR */
+				struct {
+					void __user *_lower;
+					void __user *_upper;
+				} _addr_bnd;
+				/* used when si_code=SEGV_PKUERR */
+				u64 _pkey;
+			};
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -137,6 +142,7 @@ typedef struct siginfo {
 #define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_lower	_sifields._sigfault._addr_bnd._lower
 #define si_upper	_sifields._sigfault._addr_bnd._upper
+#define si_pkey		_sifields._sigfault._pkey
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 #ifdef __ARCH_SIGSYS
@@ -206,7 +212,8 @@ typedef struct siginfo {
 #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
 #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
 #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
-#define NSIGSEGV	3
+#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection key checks */
+#define NSIGSEGV	4
 
 /*
  * SIGBUS si_codes
diff -puN kernel/signal.c~pkeys-09-siginfo-core kernel/signal.c
--- a/kernel/signal.c~pkeys-09-siginfo-core	2015-12-03 16:21:23.414570989 -0800
+++ b/kernel/signal.c	2015-12-03 16:21:23.418571170 -0800
@@ -2709,6 +2709,10 @@ int copy_siginfo_to_user(siginfo_t __use
 			err |= __put_user(from->si_upper, &to->si_upper);
 		}
 #endif
+#ifdef SEGV_PKUERR
+		if (from->si_signo == SIGSEGV && from->si_code == SEGV_PKUERR)
+			err |= __put_user(from->si_pkey, &to->si_pkey);
+#endif
 		break;
 	case __SI_CHLD:
 		err |= __put_user(from->si_pid, &to->si_pid);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 13/34] x86, pkeys: fill in pkey field in siginfo
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


This fills in the new siginfo field: si_pkey to indicate to
userspace which protection key was set on the PTE that we faulted
on.

Note though that *ALL* protection key faults have to be generated
by a valid, present PTE at some point.  But this code does no PTE
lookups which seeds odd.  The reason is that we take advantage of
the way we generate PTEs from VMAs.  All PTEs under a VMA share
some attributes.  For instance, they are _all_ either PROT_READ
*OR* PROT_NONE.  They also always share a protection key, so we
never have to walk the page tables; we just use the VMA.

Note that _pkey is a 64-bit value.  The current hardware only
supports 4-bit protection keys.  We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/pgtable_types.h |    5 ++
 b/arch/x86/mm/fault.c                  |   64 ++++++++++++++++++++++++++++++++-
 2 files changed, 68 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo-x86 arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo-x86	2015-12-03 16:21:23.853590899 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2015-12-03 16:21:23.858591126 -0800
@@ -64,6 +64,11 @@
 #endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
+#define _PAGE_PKEY_MASK (_PAGE_PKEY_BIT0 | \
+			 _PAGE_PKEY_BIT1 | \
+			 _PAGE_PKEY_BIT2 | \
+			 _PAGE_PKEY_BIT3)
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else
diff -puN arch/x86/mm/fault.c~pkeys-09-siginfo-x86 arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-09-siginfo-x86	2015-12-03 16:21:23.854590944 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:23.859591171 -0800
@@ -15,12 +15,14 @@
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 
+#include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
 #include <asm/kmemcheck.h>		/* kmemcheck_*(), ...		*/
 #include <asm/fixmap.h>			/* VSYSCALL_ADDR		*/
 #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
 #include <asm/vm86.h>			/* struct vm86			*/
+#include <asm/mmu_context.h>		/* vma_pkey()			*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -169,6 +171,56 @@ is_prefetch(struct pt_regs *regs, unsign
 	return prefetch;
 }
 
+/*
+ * A protection key fault means that the PKRU value did not allow
+ * access to some PTE.  Userspace can figure out what PKRU was
+ * from the XSAVE state, and this function fills out a field in
+ * siginfo so userspace can discover which protection key was set
+ * on the PTE.
+ *
+ * If we get here, we know that the hardware signaled a PF_PK
+ * fault and that there was a VMA once we got in the fault
+ * handler.  It does *not* guarantee that the VMA we find here
+ * was the one that we faulted on.
+ *
+ * 1. T1   : mprotect_key(foo, PAGE_SIZE, pkey=4);
+ * 2. T1   : set PKRU to deny access to pkey=4, touches page
+ * 3. T1   : faults...
+ * 4.    T2: mprotect_key(foo, PAGE_SIZE, pkey=5);
+ * 5. T1   : enters fault handler, takes mmap_sem, etc...
+ * 6. T1   : reaches here, sees vma_pkey(vma)=5, when we really
+ *	     faulted on a pte with its pkey=4.
+ */
+static void fill_sig_info_pkey(int si_code, siginfo_t *info,
+		struct vm_area_struct *vma)
+{
+	/* This is effectively an #ifdef */
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	/* Fault not from Protection Keys: nothing to do */
+	if (si_code != SEGV_PKUERR)
+		return;
+	/*
+	 * force_sig_info_fault() is called from a number of
+	 * contexts, some of which have a VMA and some of which
+	 * do not.  The PF_PK handing happens after we have a
+	 * valid VMA, so we should never reach this without a
+	 * valid VMA.
+	 */
+	if (!vma) {
+		WARN_ONCE(1, "PKU fault with no VMA passed in");
+		info->si_pkey = 0;
+		return;
+	}
+	/*
+	 * si_pkey should be thought of as a strong hint, but not
+	 * absolutely guranteed to be 100% accurate because of
+	 * the race explained above.
+	 */
+	info->si_pkey = vma_pkey(vma);
+}
+
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 		     struct task_struct *tsk, struct vm_area_struct *vma,
@@ -187,6 +239,8 @@ force_sig_info_fault(int si_signo, int s
 		lsb = PAGE_SHIFT;
 	info.si_addr_lsb = lsb;
 
+	fill_sig_info_pkey(si_code, &info, vma);
+
 	force_sig_info(si_signo, &info, tsk);
 }
 
@@ -847,7 +901,15 @@ static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
+	/*
+	 * This OSPKE check is not strictly necessary at runtime.
+	 * But, doing it this way allows compiler optimizations
+	 * if pkeys are compiled out.
+	 */
+	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
+	else
+		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 13/34] x86, pkeys: fill in pkey field in siginfo
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


This fills in the new siginfo field: si_pkey to indicate to
userspace which protection key was set on the PTE that we faulted
on.

Note though that *ALL* protection key faults have to be generated
by a valid, present PTE at some point.  But this code does no PTE
lookups which seeds odd.  The reason is that we take advantage of
the way we generate PTEs from VMAs.  All PTEs under a VMA share
some attributes.  For instance, they are _all_ either PROT_READ
*OR* PROT_NONE.  They also always share a protection key, so we
never have to walk the page tables; we just use the VMA.

Note that _pkey is a 64-bit value.  The current hardware only
supports 4-bit protection keys.  We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/pgtable_types.h |    5 ++
 b/arch/x86/mm/fault.c                  |   64 ++++++++++++++++++++++++++++++++-
 2 files changed, 68 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo-x86 arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo-x86	2015-12-03 16:21:23.853590899 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2015-12-03 16:21:23.858591126 -0800
@@ -64,6 +64,11 @@
 #endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
+#define _PAGE_PKEY_MASK (_PAGE_PKEY_BIT0 | \
+			 _PAGE_PKEY_BIT1 | \
+			 _PAGE_PKEY_BIT2 | \
+			 _PAGE_PKEY_BIT3)
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else
diff -puN arch/x86/mm/fault.c~pkeys-09-siginfo-x86 arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-09-siginfo-x86	2015-12-03 16:21:23.854590944 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:23.859591171 -0800
@@ -15,12 +15,14 @@
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 
+#include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
 #include <asm/kmemcheck.h>		/* kmemcheck_*(), ...		*/
 #include <asm/fixmap.h>			/* VSYSCALL_ADDR		*/
 #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
 #include <asm/vm86.h>			/* struct vm86			*/
+#include <asm/mmu_context.h>		/* vma_pkey()			*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -169,6 +171,56 @@ is_prefetch(struct pt_regs *regs, unsign
 	return prefetch;
 }
 
+/*
+ * A protection key fault means that the PKRU value did not allow
+ * access to some PTE.  Userspace can figure out what PKRU was
+ * from the XSAVE state, and this function fills out a field in
+ * siginfo so userspace can discover which protection key was set
+ * on the PTE.
+ *
+ * If we get here, we know that the hardware signaled a PF_PK
+ * fault and that there was a VMA once we got in the fault
+ * handler.  It does *not* guarantee that the VMA we find here
+ * was the one that we faulted on.
+ *
+ * 1. T1   : mprotect_key(foo, PAGE_SIZE, pkey=4);
+ * 2. T1   : set PKRU to deny access to pkey=4, touches page
+ * 3. T1   : faults...
+ * 4.    T2: mprotect_key(foo, PAGE_SIZE, pkey=5);
+ * 5. T1   : enters fault handler, takes mmap_sem, etc...
+ * 6. T1   : reaches here, sees vma_pkey(vma)=5, when we really
+ *	     faulted on a pte with its pkey=4.
+ */
+static void fill_sig_info_pkey(int si_code, siginfo_t *info,
+		struct vm_area_struct *vma)
+{
+	/* This is effectively an #ifdef */
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	/* Fault not from Protection Keys: nothing to do */
+	if (si_code != SEGV_PKUERR)
+		return;
+	/*
+	 * force_sig_info_fault() is called from a number of
+	 * contexts, some of which have a VMA and some of which
+	 * do not.  The PF_PK handing happens after we have a
+	 * valid VMA, so we should never reach this without a
+	 * valid VMA.
+	 */
+	if (!vma) {
+		WARN_ONCE(1, "PKU fault with no VMA passed in");
+		info->si_pkey = 0;
+		return;
+	}
+	/*
+	 * si_pkey should be thought of as a strong hint, but not
+	 * absolutely guranteed to be 100% accurate because of
+	 * the race explained above.
+	 */
+	info->si_pkey = vma_pkey(vma);
+}
+
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 		     struct task_struct *tsk, struct vm_area_struct *vma,
@@ -187,6 +239,8 @@ force_sig_info_fault(int si_signo, int s
 		lsb = PAGE_SHIFT;
 	info.si_addr_lsb = lsb;
 
+	fill_sig_info_pkey(si_code, &info, vma);
+
 	force_sig_info(si_signo, &info, tsk);
 }
 
@@ -847,7 +901,15 @@ static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
+	/*
+	 * This OSPKE check is not strictly necessary at runtime.
+	 * But, doing it this way allows compiler optimizations
+	 * if pkeys are compiled out.
+	 */
+	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
+	else
+		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 14/34] x86, pkeys: add functions to fetch PKRU
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This adds the raw instruction to access PKRU as well as some
accessor functions that correctly handle when the CPU does not
support the instruction.  We don't use it here, but we will use
read_pkru() in the next patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/pgtable.h       |    8 ++++++++
 b/arch/x86/include/asm/special_insns.h |   22 ++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff -puN arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions	2015-12-03 16:21:24.298611081 -0800
+++ b/arch/x86/include/asm/pgtable.h	2015-12-03 16:21:24.303611308 -0800
@@ -102,6 +102,14 @@ static inline int pte_dirty(pte_t pte)
 	return pte_flags(pte) & _PAGE_DIRTY;
 }
 
+
+static inline u32 read_pkru(void)
+{
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		return __read_pkru();
+	return 0;
+}
+
 static inline int pte_young(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_ACCESSED;
diff -puN arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/special_insns.h
--- a/arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions	2015-12-03 16:21:24.300611172 -0800
+++ b/arch/x86/include/asm/special_insns.h	2015-12-03 16:21:24.303611308 -0800
@@ -98,6 +98,28 @@ static inline void native_write_cr8(unsi
 }
 #endif
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static inline u32 __read_pkru(void)
+{
+	unsigned int ecx = 0;
+	unsigned int edx, pkru;
+
+	/*
+	 * "rdpkru" instruction.  Places PKRU contents in to EAX,
+	 * clears EDX and requires that ecx=0.
+	 */
+	asm volatile(".byte 0x0f,0x01,0xee\n\t"
+		     : "=a" (pkru), "=d" (edx)
+		     : "c" (ecx));
+	return pkru;
+}
+#else
+static inline u32 __read_pkru(void)
+{
+	return 0;
+}
+#endif
+
 static inline void native_wbinvd(void)
 {
 	asm volatile("wbinvd": : :"memory");
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 14/34] x86, pkeys: add functions to fetch PKRU
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This adds the raw instruction to access PKRU as well as some
accessor functions that correctly handle when the CPU does not
support the instruction.  We don't use it here, but we will use
read_pkru() in the next patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/pgtable.h       |    8 ++++++++
 b/arch/x86/include/asm/special_insns.h |   22 ++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff -puN arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions	2015-12-03 16:21:24.298611081 -0800
+++ b/arch/x86/include/asm/pgtable.h	2015-12-03 16:21:24.303611308 -0800
@@ -102,6 +102,14 @@ static inline int pte_dirty(pte_t pte)
 	return pte_flags(pte) & _PAGE_DIRTY;
 }
 
+
+static inline u32 read_pkru(void)
+{
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		return __read_pkru();
+	return 0;
+}
+
 static inline int pte_young(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_ACCESSED;
diff -puN arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/special_insns.h
--- a/arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions	2015-12-03 16:21:24.300611172 -0800
+++ b/arch/x86/include/asm/special_insns.h	2015-12-03 16:21:24.303611308 -0800
@@ -98,6 +98,28 @@ static inline void native_write_cr8(unsi
 }
 #endif
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static inline u32 __read_pkru(void)
+{
+	unsigned int ecx = 0;
+	unsigned int edx, pkru;
+
+	/*
+	 * "rdpkru" instruction.  Places PKRU contents in to EAX,
+	 * clears EDX and requires that ecx=0.
+	 */
+	asm volatile(".byte 0x0f,0x01,0xee\n\t"
+		     : "=a" (pkru), "=d" (edx)
+		     : "c" (ecx));
+	return pkru;
+}
+#else
+static inline u32 __read_pkru(void)
+{
+	return 0;
+}
+#endif
+
 static inline void native_wbinvd(void)
 {
 	asm volatile("wbinvd": : :"memory");
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 15/34] mm: factor out VMA fault permission checking
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This code matches a fault condition up with the VMA and ensures
that the VMA allows the fault to be handled instead of just
erroring out.

We will be extending this in a moment to comprehend protection
keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/mm/gup.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff -puN mm/gup.c~pkeys-10-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-10-pte-fault	2015-12-03 16:21:24.737630991 -0800
+++ b/mm/gup.c	2015-12-03 16:21:24.741631172 -0800
@@ -557,6 +557,18 @@ next_page:
 }
 EXPORT_SYMBOL(__get_user_pages);
 
+bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
+{
+	vm_flags_t vm_flags;
+
+	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+
+	if (!(vm_flags & vma->vm_flags))
+		return false;
+
+	return true;
+}
+
 /*
  * fixup_user_fault() - manually resolve a user page fault
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -588,15 +600,13 @@ int fixup_user_fault(struct task_struct
 		     unsigned long address, unsigned int fault_flags)
 {
 	struct vm_area_struct *vma;
-	vm_flags_t vm_flags;
 	int ret;
 
 	vma = find_extend_vma(mm, address);
 	if (!vma || address < vma->vm_start)
 		return -EFAULT;
 
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
-	if (!(vm_flags & vma->vm_flags))
+	if (!vma_permits_fault(vma, fault_flags))
 		return -EFAULT;
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 15/34] mm: factor out VMA fault permission checking
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This code matches a fault condition up with the VMA and ensures
that the VMA allows the fault to be handled instead of just
erroring out.

We will be extending this in a moment to comprehend protection
keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/mm/gup.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff -puN mm/gup.c~pkeys-10-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-10-pte-fault	2015-12-03 16:21:24.737630991 -0800
+++ b/mm/gup.c	2015-12-03 16:21:24.741631172 -0800
@@ -557,6 +557,18 @@ next_page:
 }
 EXPORT_SYMBOL(__get_user_pages);
 
+bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
+{
+	vm_flags_t vm_flags;
+
+	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+
+	if (!(vm_flags & vma->vm_flags))
+		return false;
+
+	return true;
+}
+
 /*
  * fixup_user_fault() - manually resolve a user page fault
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -588,15 +600,13 @@ int fixup_user_fault(struct task_struct
 		     unsigned long address, unsigned int fault_flags)
 {
 	struct vm_area_struct *vma;
-	vm_flags_t vm_flags;
 	int ret;
 
 	vma = find_extend_vma(mm, address);
 	if (!vma || address < vma->vm_start)
 		return -EFAULT;
 
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
-	if (!(vm_flags & vma->vm_flags))
+	if (!vma_permits_fault(vma, fault_flags))
 		return -EFAULT;
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 16/34] x86, mm: simplify get_user_pages() PTE bit handling
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The current get_user_pages() code is a wee bit more complicated
than it needs to be for pte bit checking.  Currently, it establishes
a mask of required pte _PAGE_* bits and ensures that the pte it
goes after has all those bits.

This consolidates the three identical copies of this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/gup.c |   45 ++++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff -puN arch/x86/mm/gup.c~pkeys-16-gup-swizzle arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-16-gup-swizzle	2015-12-03 16:21:25.148649631 -0800
+++ b/arch/x86/mm/gup.c	2015-12-03 16:21:25.151649767 -0800
@@ -63,6 +63,30 @@ retry:
 #endif
 }
 
+static inline int pte_allows_gup(pte_t pte, int write)
+{
+	/*
+	 * 'pte' can reall be a pte, pmd or pud.  We only check
+	 * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which
+	 * are the same value on all 3 types.
+	 */
+	if (!(pte_flags(pte) & (_PAGE_PRESENT|_PAGE_USER)))
+		return 0;
+	if (write && !(pte_write(pte)))
+		return 0;
+	return 1;
+}
+
+static inline int pmd_allows_gup(pmd_t pmd, int write)
+{
+	return pte_allows_gup(*(pte_t *)&pmd, write);
+}
+
+static inline int pud_allows_gup(pud_t pud, int write)
+{
+	return pte_allows_gup(*(pte_t *)&pud, write);
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -71,13 +95,8 @@ retry:
 static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t *ptep;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-
 	ptep = pte_offset_map(&pmd, addr);
 	do {
 		pte_t pte = gup_get_pte(ptep);
@@ -88,8 +107,8 @@ static noinline int gup_pte_range(pmd_t
 			pte_unmap(ptep);
 			return 0;
 		}
-
-		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+		if (!pte_allows_gup(pte, write) ||
+		    pte_special(pte)) {
 			pte_unmap(ptep);
 			return 0;
 		}
@@ -117,14 +136,10 @@ static inline void get_head_page_multipl
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pmd_flags(pmd) & mask) != mask)
+	if (!pmd_allows_gup(pmd, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pmd_flags(pmd) & _PAGE_SPECIAL);
@@ -193,14 +208,10 @@ static int gup_pmd_range(pud_t pud, unsi
 static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pud_flags(pud) & mask) != mask)
+	if (!pud_allows_gup(pud, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pud_flags(pud) & _PAGE_SPECIAL);
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 16/34] x86, mm: simplify get_user_pages() PTE bit handling
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The current get_user_pages() code is a wee bit more complicated
than it needs to be for pte bit checking.  Currently, it establishes
a mask of required pte _PAGE_* bits and ensures that the pte it
goes after has all those bits.

This consolidates the three identical copies of this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/gup.c |   45 ++++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff -puN arch/x86/mm/gup.c~pkeys-16-gup-swizzle arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-16-gup-swizzle	2015-12-03 16:21:25.148649631 -0800
+++ b/arch/x86/mm/gup.c	2015-12-03 16:21:25.151649767 -0800
@@ -63,6 +63,30 @@ retry:
 #endif
 }
 
+static inline int pte_allows_gup(pte_t pte, int write)
+{
+	/*
+	 * 'pte' can reall be a pte, pmd or pud.  We only check
+	 * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which
+	 * are the same value on all 3 types.
+	 */
+	if (!(pte_flags(pte) & (_PAGE_PRESENT|_PAGE_USER)))
+		return 0;
+	if (write && !(pte_write(pte)))
+		return 0;
+	return 1;
+}
+
+static inline int pmd_allows_gup(pmd_t pmd, int write)
+{
+	return pte_allows_gup(*(pte_t *)&pmd, write);
+}
+
+static inline int pud_allows_gup(pud_t pud, int write)
+{
+	return pte_allows_gup(*(pte_t *)&pud, write);
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -71,13 +95,8 @@ retry:
 static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t *ptep;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-
 	ptep = pte_offset_map(&pmd, addr);
 	do {
 		pte_t pte = gup_get_pte(ptep);
@@ -88,8 +107,8 @@ static noinline int gup_pte_range(pmd_t
 			pte_unmap(ptep);
 			return 0;
 		}
-
-		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+		if (!pte_allows_gup(pte, write) ||
+		    pte_special(pte)) {
 			pte_unmap(ptep);
 			return 0;
 		}
@@ -117,14 +136,10 @@ static inline void get_head_page_multipl
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pmd_flags(pmd) & mask) != mask)
+	if (!pmd_allows_gup(pmd, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pmd_flags(pmd) & _PAGE_SPECIAL);
@@ -193,14 +208,10 @@ static int gup_pmd_range(pud_t pud, unsi
 static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pud_flags(pud) & mask) != mask)
+	if (!pud_allows_gup(pud, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pud_flags(pud) & _PAGE_SPECIAL);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 17/34] x86, pkeys: check VMAs and PTEs for protection keys
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Today, for normal faults and page table walks, we check the VMA
and/or PTE to ensure that it is compatible with the action.  For
instance, if we get a write fault on a non-writeable VMA, we
SIGSEGV.

We try to do the same thing for protection keys.  Basically, we
try to make sure that if a user does this:

	mprotect(ptr, size, PROT_NONE);
	*ptr = foo;

they see the same effects with protection keys when they do this:

	mprotect(ptr, size, PROT_READ|PROT_WRITE);
	set_pkey(ptr, size, 4);
	wrpkru(0xffffff3f); // access disable pkey 4
	*ptr = foo;

The state to do that checking is in the VMA, but we also
sometimes have to do it on the page tables only, like when doing
a get_user_pages_fast() where we have no VMA.

We add two functions and expose them to generic code:

	arch_pte_access_permitted(pte_flags, write)
	arch_vma_access_permitted(vma, write)

These are, of course, backed up in x86 arch code with checks
against the PTE or VMA's protection key.

But, there are also cases where we do not want to respect
protection keys.  When we ptrace(), for instance, we do not want
to apply the tracer's PKRU permissions to the PTEs from the
process being traced.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/powerpc/include/asm/mmu_context.h   |   11 ++++++
 b/arch/s390/include/asm/mmu_context.h      |   11 ++++++
 b/arch/unicore32/include/asm/mmu_context.h |   11 ++++++
 b/arch/x86/include/asm/mmu_context.h       |   49 +++++++++++++++++++++++++++++
 b/arch/x86/include/asm/pgtable.h           |   29 +++++++++++++++++
 b/arch/x86/mm/fault.c                      |   21 +++++++++++-
 b/arch/x86/mm/gup.c                        |    4 ++
 b/include/asm-generic/mm_hooks.h           |   11 ++++++
 b/mm/gup.c                                 |   18 ++++++++--
 b/mm/memory.c                              |    4 ++
 10 files changed, 165 insertions(+), 4 deletions(-)

diff -puN arch/powerpc/include/asm/mmu_context.h~pkeys-11-pte-fault arch/powerpc/include/asm/mmu_context.h
--- a/arch/powerpc/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-12-03 16:21:25.567668634 -0800
+++ b/arch/powerpc/include/asm/mmu_context.h	2015-12-03 16:21:25.585669451 -0800
@@ -148,5 +148,16 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff -puN arch/s390/include/asm/mmu_context.h~pkeys-11-pte-fault arch/s390/include/asm/mmu_context.h
--- a/arch/s390/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-12-03 16:21:25.569668725 -0800
+++ b/arch/s390/include/asm/mmu_context.h	2015-12-03 16:21:25.586669496 -0800
@@ -130,4 +130,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __S390_MMU_CONTEXT_H */
diff -puN arch/unicore32/include/asm/mmu_context.h~pkeys-11-pte-fault arch/unicore32/include/asm/mmu_context.h
--- a/arch/unicore32/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-12-03 16:21:25.570668770 -0800
+++ b/arch/unicore32/include/asm/mmu_context.h	2015-12-03 16:21:25.586669496 -0800
@@ -97,4 +97,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-12-03 16:21:25.572668861 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:25.586669496 -0800
@@ -263,4 +263,53 @@ static inline int vma_pkey(struct vm_are
 	return pkey;
 }
 
+static inline bool __pkru_allows_pkey(u16 pkey, bool write)
+{
+	u32 pkru = read_pkru();
+
+	if (!__pkru_allows_read(pkru, pkey))
+		return false;
+	if (write && !__pkru_allows_write(pkru, pkey))
+		return false;
+
+	return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to PKRU for other
+ * processes or any way to tell *which * PKRU in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+	if (!current->mm)
+		return true;
+	/*
+	 * Should PKRU be enforced on the access to this VMA?  If
+	 * the VMA is from another process, then PKRU has no
+	 * relevance and should not be enforced.
+	 */
+	if (current->mm != vma->vm_mm)
+		return true;
+
+	return false;
+}
+
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* allow access if the VMA is not one from this process */
+	if (vma_is_foreign(vma))
+		return true;
+	return __pkru_allows_pkey(vma_pkey(vma), write);
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	return __pkru_allows_pkey(pte_flags_pkey(pte_flags(pte)), write);
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault	2015-12-03 16:21:25.574668952 -0800
+++ b/arch/x86/include/asm/pgtable.h	2015-12-03 16:21:25.587669541 -0800
@@ -910,6 +910,35 @@ static inline pte_t pte_swp_clear_soft_d
 }
 #endif
 
+#define PKRU_AD_BIT 0x1
+#define PKRU_WD_BIT 0x2
+
+static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+}
+
+static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	/*
+	 * Access-disable disables writes too so we need to check
+	 * both bits here.
+	 */
+	return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+}
+
+static inline u16 pte_flags_pkey(unsigned long pte_flags)
+{
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/* ifdef to avoid doing 59-bit shift on 32-bit values */
+	return (pte_flags & _PAGE_PKEY_MASK) >> _PAGE_BIT_PKEY_BIT0;
+#else
+	return 0;
+#endif
+}
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff -puN arch/x86/mm/fault.c~pkeys-11-pte-fault arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-11-pte-fault	2015-12-03 16:21:25.575668997 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:25.587669541 -0800
@@ -897,6 +897,16 @@ bad_area(struct pt_regs *regs, unsigned
 	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
+static inline bool bad_area_access_from_pkeys(unsigned long error_code,
+		struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return false;
+	if (error_code & PF_PK)
+		return true;
+	return false;
+}
+
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
@@ -906,7 +916,7 @@ bad_area_access_error(struct pt_regs *re
 	 * But, doing it this way allows compiler optimizations
 	 * if pkeys are compiled out.
 	 */
-	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+	if (bad_area_access_from_pkeys(error_code, vma))
 		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
 	else
 		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
@@ -1081,6 +1091,15 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/*
+	 * Access or read was blocked by protection keys. We do
+	 * this check before any others because we do not want
+	 * to, for instance, confuse a protection-key-denied
+	 * write with one for which we should do a COW.
+	 */
+	if (error_code & PF_PK)
+		return 1;
+
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
diff -puN arch/x86/mm/gup.c~pkeys-11-pte-fault arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-11-pte-fault	2015-12-03 16:21:25.577669088 -0800
+++ b/arch/x86/mm/gup.c	2015-12-03 16:21:25.588669587 -0800
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/swap.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 
 static inline pte_t gup_get_pte(pte_t *ptep)
@@ -74,6 +75,9 @@ static inline int pte_allows_gup(pte_t p
 		return 0;
 	if (write && !(pte_write(pte)))
 		return 0;
+	/* This one checks memory protection keys. */
+	if (!arch_pte_access_permitted(pte, write))
+		return 0;
 	return 1;
 }
 
diff -puN include/asm-generic/mm_hooks.h~pkeys-11-pte-fault include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-11-pte-fault	2015-12-03 16:21:25.579669178 -0800
+++ b/include/asm-generic/mm_hooks.h	2015-12-03 16:21:25.588669587 -0800
@@ -26,4 +26,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif	/* _ASM_GENERIC_MM_HOOKS_H */
diff -puN mm/gup.c~pkeys-11-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-11-pte-fault	2015-12-03 16:21:25.580669224 -0800
+++ b/mm/gup.c	2015-12-03 16:21:25.589669632 -0800
@@ -13,6 +13,7 @@
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
@@ -391,6 +392,8 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
+	if (!arch_vma_access_permitted(vma, (gup_flags & FOLL_WRITE)))
+		return -EFAULT;
 	return 0;
 }
 
@@ -559,13 +562,19 @@ EXPORT_SYMBOL(__get_user_pages);
 
 bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 {
-	vm_flags_t vm_flags;
-
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+	bool write = !!(fault_flags & FAULT_FLAG_WRITE);
+	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
 
 	if (!(vm_flags & vma->vm_flags))
 		return false;
 
+	/*
+	 * The architecture might have a hardware protection
+	 * mechanism other than read/write that can deny access
+	 */
+	if (!arch_vma_access_permitted(vma, write))
+		return false;
+
 	return true;
 }
 
@@ -1102,6 +1111,9 @@ static int gup_pte_range(pmd_t pmd, unsi
 			pte_protnone(pte) || (write && !pte_write(pte)))
 			goto pte_unmap;
 
+		if (!arch_pte_access_permitted(pte, write))
+			goto pte_unmap;
+
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
diff -puN mm/memory.c~pkeys-11-pte-fault mm/memory.c
--- a/mm/memory.c~pkeys-11-pte-fault	2015-12-03 16:21:25.582669314 -0800
+++ b/mm/memory.c	2015-12-03 16:21:25.590669677 -0800
@@ -64,6 +64,7 @@
 #include <linux/userfaultfd_k.h>
 
 #include <asm/io.h>
+#include <asm/mmu_context.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 #include <asm/tlb.h>
@@ -3344,6 +3345,9 @@ static int __handle_mm_fault(struct mm_s
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE))
+		return VM_FAULT_SIGSEGV;
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 17/34] x86, pkeys: check VMAs and PTEs for protection keys
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Today, for normal faults and page table walks, we check the VMA
and/or PTE to ensure that it is compatible with the action.  For
instance, if we get a write fault on a non-writeable VMA, we
SIGSEGV.

We try to do the same thing for protection keys.  Basically, we
try to make sure that if a user does this:

	mprotect(ptr, size, PROT_NONE);
	*ptr = foo;

they see the same effects with protection keys when they do this:

	mprotect(ptr, size, PROT_READ|PROT_WRITE);
	set_pkey(ptr, size, 4);
	wrpkru(0xffffff3f); // access disable pkey 4
	*ptr = foo;

The state to do that checking is in the VMA, but we also
sometimes have to do it on the page tables only, like when doing
a get_user_pages_fast() where we have no VMA.

We add two functions and expose them to generic code:

	arch_pte_access_permitted(pte_flags, write)
	arch_vma_access_permitted(vma, write)

These are, of course, backed up in x86 arch code with checks
against the PTE or VMA's protection key.

But, there are also cases where we do not want to respect
protection keys.  When we ptrace(), for instance, we do not want
to apply the tracer's PKRU permissions to the PTEs from the
process being traced.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/powerpc/include/asm/mmu_context.h   |   11 ++++++
 b/arch/s390/include/asm/mmu_context.h      |   11 ++++++
 b/arch/unicore32/include/asm/mmu_context.h |   11 ++++++
 b/arch/x86/include/asm/mmu_context.h       |   49 +++++++++++++++++++++++++++++
 b/arch/x86/include/asm/pgtable.h           |   29 +++++++++++++++++
 b/arch/x86/mm/fault.c                      |   21 +++++++++++-
 b/arch/x86/mm/gup.c                        |    4 ++
 b/include/asm-generic/mm_hooks.h           |   11 ++++++
 b/mm/gup.c                                 |   18 ++++++++--
 b/mm/memory.c                              |    4 ++
 10 files changed, 165 insertions(+), 4 deletions(-)

diff -puN arch/powerpc/include/asm/mmu_context.h~pkeys-11-pte-fault arch/powerpc/include/asm/mmu_context.h
--- a/arch/powerpc/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-12-03 16:21:25.567668634 -0800
+++ b/arch/powerpc/include/asm/mmu_context.h	2015-12-03 16:21:25.585669451 -0800
@@ -148,5 +148,16 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff -puN arch/s390/include/asm/mmu_context.h~pkeys-11-pte-fault arch/s390/include/asm/mmu_context.h
--- a/arch/s390/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-12-03 16:21:25.569668725 -0800
+++ b/arch/s390/include/asm/mmu_context.h	2015-12-03 16:21:25.586669496 -0800
@@ -130,4 +130,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __S390_MMU_CONTEXT_H */
diff -puN arch/unicore32/include/asm/mmu_context.h~pkeys-11-pte-fault arch/unicore32/include/asm/mmu_context.h
--- a/arch/unicore32/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-12-03 16:21:25.570668770 -0800
+++ b/arch/unicore32/include/asm/mmu_context.h	2015-12-03 16:21:25.586669496 -0800
@@ -97,4 +97,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-12-03 16:21:25.572668861 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:25.586669496 -0800
@@ -263,4 +263,53 @@ static inline int vma_pkey(struct vm_are
 	return pkey;
 }
 
+static inline bool __pkru_allows_pkey(u16 pkey, bool write)
+{
+	u32 pkru = read_pkru();
+
+	if (!__pkru_allows_read(pkru, pkey))
+		return false;
+	if (write && !__pkru_allows_write(pkru, pkey))
+		return false;
+
+	return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to PKRU for other
+ * processes or any way to tell *which * PKRU in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+	if (!current->mm)
+		return true;
+	/*
+	 * Should PKRU be enforced on the access to this VMA?  If
+	 * the VMA is from another process, then PKRU has no
+	 * relevance and should not be enforced.
+	 */
+	if (current->mm != vma->vm_mm)
+		return true;
+
+	return false;
+}
+
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* allow access if the VMA is not one from this process */
+	if (vma_is_foreign(vma))
+		return true;
+	return __pkru_allows_pkey(vma_pkey(vma), write);
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	return __pkru_allows_pkey(pte_flags_pkey(pte_flags(pte)), write);
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault	2015-12-03 16:21:25.574668952 -0800
+++ b/arch/x86/include/asm/pgtable.h	2015-12-03 16:21:25.587669541 -0800
@@ -910,6 +910,35 @@ static inline pte_t pte_swp_clear_soft_d
 }
 #endif
 
+#define PKRU_AD_BIT 0x1
+#define PKRU_WD_BIT 0x2
+
+static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+}
+
+static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	/*
+	 * Access-disable disables writes too so we need to check
+	 * both bits here.
+	 */
+	return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+}
+
+static inline u16 pte_flags_pkey(unsigned long pte_flags)
+{
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/* ifdef to avoid doing 59-bit shift on 32-bit values */
+	return (pte_flags & _PAGE_PKEY_MASK) >> _PAGE_BIT_PKEY_BIT0;
+#else
+	return 0;
+#endif
+}
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff -puN arch/x86/mm/fault.c~pkeys-11-pte-fault arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-11-pte-fault	2015-12-03 16:21:25.575668997 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:25.587669541 -0800
@@ -897,6 +897,16 @@ bad_area(struct pt_regs *regs, unsigned
 	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
+static inline bool bad_area_access_from_pkeys(unsigned long error_code,
+		struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return false;
+	if (error_code & PF_PK)
+		return true;
+	return false;
+}
+
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
@@ -906,7 +916,7 @@ bad_area_access_error(struct pt_regs *re
 	 * But, doing it this way allows compiler optimizations
 	 * if pkeys are compiled out.
 	 */
-	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+	if (bad_area_access_from_pkeys(error_code, vma))
 		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
 	else
 		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
@@ -1081,6 +1091,15 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/*
+	 * Access or read was blocked by protection keys. We do
+	 * this check before any others because we do not want
+	 * to, for instance, confuse a protection-key-denied
+	 * write with one for which we should do a COW.
+	 */
+	if (error_code & PF_PK)
+		return 1;
+
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
diff -puN arch/x86/mm/gup.c~pkeys-11-pte-fault arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-11-pte-fault	2015-12-03 16:21:25.577669088 -0800
+++ b/arch/x86/mm/gup.c	2015-12-03 16:21:25.588669587 -0800
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/swap.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 
 static inline pte_t gup_get_pte(pte_t *ptep)
@@ -74,6 +75,9 @@ static inline int pte_allows_gup(pte_t p
 		return 0;
 	if (write && !(pte_write(pte)))
 		return 0;
+	/* This one checks memory protection keys. */
+	if (!arch_pte_access_permitted(pte, write))
+		return 0;
 	return 1;
 }
 
diff -puN include/asm-generic/mm_hooks.h~pkeys-11-pte-fault include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-11-pte-fault	2015-12-03 16:21:25.579669178 -0800
+++ b/include/asm-generic/mm_hooks.h	2015-12-03 16:21:25.588669587 -0800
@@ -26,4 +26,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif	/* _ASM_GENERIC_MM_HOOKS_H */
diff -puN mm/gup.c~pkeys-11-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-11-pte-fault	2015-12-03 16:21:25.580669224 -0800
+++ b/mm/gup.c	2015-12-03 16:21:25.589669632 -0800
@@ -13,6 +13,7 @@
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
@@ -391,6 +392,8 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
+	if (!arch_vma_access_permitted(vma, (gup_flags & FOLL_WRITE)))
+		return -EFAULT;
 	return 0;
 }
 
@@ -559,13 +562,19 @@ EXPORT_SYMBOL(__get_user_pages);
 
 bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 {
-	vm_flags_t vm_flags;
-
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+	bool write = !!(fault_flags & FAULT_FLAG_WRITE);
+	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
 
 	if (!(vm_flags & vma->vm_flags))
 		return false;
 
+	/*
+	 * The architecture might have a hardware protection
+	 * mechanism other than read/write that can deny access
+	 */
+	if (!arch_vma_access_permitted(vma, write))
+		return false;
+
 	return true;
 }
 
@@ -1102,6 +1111,9 @@ static int gup_pte_range(pmd_t pmd, unsi
 			pte_protnone(pte) || (write && !pte_write(pte)))
 			goto pte_unmap;
 
+		if (!arch_pte_access_permitted(pte, write))
+			goto pte_unmap;
+
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
diff -puN mm/memory.c~pkeys-11-pte-fault mm/memory.c
--- a/mm/memory.c~pkeys-11-pte-fault	2015-12-03 16:21:25.582669314 -0800
+++ b/mm/memory.c	2015-12-03 16:21:25.590669677 -0800
@@ -64,6 +64,7 @@
 #include <linux/userfaultfd_k.h>
 
 #include <asm/io.h>
+#include <asm/mmu_context.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 #include <asm/tlb.h>
@@ -3344,6 +3345,9 @@ static int __handle_mm_fault(struct mm_s
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE))
+		return VM_FAULT_SIGSEGV;
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 18/34] mm: add gup flag to indicate "foreign" mm access
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-arch


From: Dave Hansen <dave.hansen@linux.intel.com>

We try to enforce protection keys in software the same way that we
do in hardware.  (See long example below).

But, we only want to do this when accessing our *own* process's
memory.  If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
tried to PTRACE_POKE a target process which just happened to have
some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
debugger access to that memory.  PKRU is fundamentally a
thread-local structure and we do not want to enforce it on access
to _another_ thread's data.

This gets especially tricky when we have workqueues or other
delayed-work mechanisms that might run in a random process's context.
We can check that we only enforce pkeys when operating on our *own* mm,
but delayed work gets performed when a random user context is active.
We might end up with a situation where a delayed-work gup fails when
running randomly under its "own" task but succeeds when running under
another process.  We want to avoid that.

To avoid that, we add a GUP flag: FOLL_FOREIGN and a fault flag:
FAULT_FLAG_FOREIGN.  They indicate that we are walking an mm
which is not guranteed to be the same as current->mm and should
not be subject to protection key enforcement.

Thanks to Jerome Glisse for pointing out this scenario.

*** Why do we enforce protection keys in software?? ***

Imagine that we disabled access to the memory pointer to by 'buf'.
The, we implemented sys_write() like this:

	sys_read(fd, buf, len...)
	{
		struct page *page = follow_page(buf);
		void *buf_mapped = kmap(page);
		memcpy(buf_mapped, fd_data, len);
		...
	}

This writes to 'buf' via a *kernel* mapping, without a protection
key.  While this implementation does the same thing:

	sys_read(fd, buf, len...)
	{
		copy_to_user(buf, fd_data, len);
		...
	}

but would hit a protection key fault because the userspace 'buf'
mapping has a protection key set.

To provide consistency, and to make key-protected memory work
as much like mprotect()ed memory as possible, we try to enforce
the same protections as the hardware would when the *kernel* walks
the page tables (and other mm structures).

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-arch@vger.kernel.org
---

 b/arch/powerpc/include/asm/mmu_context.h   |    3 ++-
 b/arch/s390/include/asm/mmu_context.h      |    3 ++-
 b/arch/unicore32/include/asm/mmu_context.h |    3 ++-
 b/arch/x86/include/asm/mmu_context.h       |    5 +++--
 b/drivers/iommu/amd_iommu_v2.c             |    8 +++++---
 b/include/asm-generic/mm_hooks.h           |    3 ++-
 b/include/linux/mm.h                       |    2 ++
 b/mm/gup.c                                 |   15 ++++++++++-----
 b/mm/ksm.c                                 |   10 ++++++++--
 b/mm/memory.c                              |    3 ++-
 10 files changed, 38 insertions(+), 17 deletions(-)

diff -puN arch/powerpc/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag arch/powerpc/include/asm/mmu_context.h
--- a/arch/powerpc/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.223698386 -0800
+++ b/arch/powerpc/include/asm/mmu_context.h	2015-12-03 16:21:26.241699202 -0800
@@ -148,7 +148,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/s390/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag arch/s390/include/asm/mmu_context.h
--- a/arch/s390/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.224698431 -0800
+++ b/arch/s390/include/asm/mmu_context.h	2015-12-03 16:21:26.242699248 -0800
@@ -130,7 +130,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/unicore32/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag arch/unicore32/include/asm/mmu_context.h
--- a/arch/unicore32/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.226698522 -0800
+++ b/arch/unicore32/include/asm/mmu_context.h	2015-12-03 16:21:26.242699248 -0800
@@ -97,7 +97,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.228698613 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:26.242699248 -0800
@@ -299,10 +299,11 @@ static inline bool vma_is_foreign(struct
 	return false;
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* allow access if the VMA is not one from this process */
-	if (vma_is_foreign(vma))
+	if (foreign || vma_is_foreign(vma))
 		return true;
 	return __pkru_allows_pkey(vma_pkey(vma), write);
 }
diff -puN drivers/iommu/amd_iommu_v2.c~pkeys-12-gup-fault-foreign-flag drivers/iommu/amd_iommu_v2.c
--- a/drivers/iommu/amd_iommu_v2.c~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.229698658 -0800
+++ b/drivers/iommu/amd_iommu_v2.c	2015-12-03 16:21:26.243699293 -0800
@@ -500,9 +500,11 @@ static void do_fault(struct work_struct
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	u64 address;
-	int ret, write;
+	int ret, flags;
 
-	write = !!(fault->flags & PPR_FAULT_WRITE);
+	if (fault->flags & PPR_FAULT_WRITE)
+		flags = FAULT_FLAG_WRITE;
+	flags |= FAULT_FLAG_FOREIGN;
 
 	mm = fault->state->mm;
 	address = fault->address;
@@ -523,7 +525,7 @@ static void do_fault(struct work_struct
 		goto out;
 	}
 
-	ret = handle_mm_fault(mm, vma, address, write);
+	ret = handle_mm_fault(mm, vma, address, flags);
 	if (ret & VM_FAULT_ERROR) {
 		/* failed to service fault */
 		up_read(&mm->mmap_sem);
diff -puN include/asm-generic/mm_hooks.h~pkeys-12-gup-fault-foreign-flag include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.231698749 -0800
+++ b/include/asm-generic/mm_hooks.h	2015-12-03 16:21:26.243699293 -0800
@@ -26,7 +26,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN include/linux/mm.h~pkeys-12-gup-fault-foreign-flag include/linux/mm.h
--- a/include/linux/mm.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.233698839 -0800
+++ b/include/linux/mm.h	2015-12-03 16:21:26.244699338 -0800
@@ -232,6 +232,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x20	/* Second try */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
+#define FAULT_FLAG_FOREIGN	0x80	/* faulting for non current tsk/mm */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -2138,6 +2139,7 @@ static inline struct page *follow_page(s
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
 #define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
 #define FOLL_MLOCK	0x1000	/* lock present pages */
+#define FOLL_FOREIGN	0x2000	/* we are working on non-current tsk/mm */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff -puN mm/gup.c~pkeys-12-gup-fault-foreign-flag mm/gup.c
--- a/mm/gup.c~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.234698885 -0800
+++ b/mm/gup.c	2015-12-03 16:21:26.245699384 -0800
@@ -310,6 +310,8 @@ static int faultin_page(struct task_stru
 		return -ENOENT;
 	if (*flags & FOLL_WRITE)
 		fault_flags |= FAULT_FLAG_WRITE;
+	if (*flags & FOLL_FOREIGN)
+		fault_flags |= FAULT_FLAG_FOREIGN;
 	if (nonblocking)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 	if (*flags & FOLL_NOWAIT)
@@ -360,11 +362,13 @@ static int faultin_page(struct task_stru
 static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 {
 	vm_flags_t vm_flags = vma->vm_flags;
+	int write = (gup_flags & FOLL_WRITE);
+	int foreign = (gup_flags & FOLL_FOREIGN);
 
 	if (vm_flags & (VM_IO | VM_PFNMAP))
 		return -EFAULT;
 
-	if (gup_flags & FOLL_WRITE) {
+	if (write) {
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;
@@ -392,7 +396,7 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
-	if (!arch_vma_access_permitted(vma, (gup_flags & FOLL_WRITE)))
+	if (!arch_vma_access_permitted(vma, write, foreign))
 		return -EFAULT;
 	return 0;
 }
@@ -562,7 +566,8 @@ EXPORT_SYMBOL(__get_user_pages);
 
 bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 {
-	bool write = !!(fault_flags & FAULT_FLAG_WRITE);
+	bool write   = !!(fault_flags & FAULT_FLAG_WRITE);
+	bool foreign = !!(fault_flags & FAULT_FLAG_FOREIGN);
 	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
 
 	if (!(vm_flags & vma->vm_flags))
@@ -570,9 +575,9 @@ bool vma_permits_fault(struct vm_area_st
 
 	/*
 	 * The architecture might have a hardware protection
-	 * mechanism other than read/write that can deny access
+	 * mechanism other than read/write that can deny access.
 	 */
-	if (!arch_vma_access_permitted(vma, write))
+	if (!arch_vma_access_permitted(vma, write, foreign))
 		return false;
 
 	return true;
diff -puN mm/ksm.c~pkeys-12-gup-fault-foreign-flag mm/ksm.c
--- a/mm/ksm.c~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.236698975 -0800
+++ b/mm/ksm.c	2015-12-03 16:21:26.246699429 -0800
@@ -359,6 +359,10 @@ static inline bool ksm_test_exit(struct
  * in case the application has unmapped and remapped mm,addr meanwhile.
  * Could a ksm page appear anywhere else?  Actually yes, in a VM_PFNMAP
  * mmap of /dev/mem or /dev/kmem, where we would not want to touch it.
+ *
+ * FAULT_FLAG/FOLL_FOREIGN are because we do this outside the context
+ * of the process that owns 'vma'.  We also do not want to enforce
+ * protection keys here anyway.
  */
 static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 {
@@ -367,12 +371,14 @@ static int break_ksm(struct vm_area_stru
 
 	do {
 		cond_resched();
-		page = follow_page(vma, addr, FOLL_GET | FOLL_MIGRATION);
+		page = follow_page(vma, addr,
+				FOLL_GET | FOLL_MIGRATION | FOLL_FOREIGN);
 		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma->vm_mm, vma, addr,
-							FAULT_FLAG_WRITE);
+							FAULT_FLAG_WRITE |
+							FAULT_FLAG_FOREIGN);
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
diff -puN mm/memory.c~pkeys-12-gup-fault-foreign-flag mm/memory.c
--- a/mm/memory.c~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.238699066 -0800
+++ b/mm/memory.c	2015-12-03 16:21:26.247699474 -0800
@@ -3345,7 +3345,8 @@ static int __handle_mm_fault(struct mm_s
 	pmd_t *pmd;
 	pte_t *pte;
 
-	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE))
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+					    flags & FAULT_FLAG_FOREIGN))
 		return VM_FAULT_SIGSEGV;
 
 	if (unlikely(is_vm_hugetlb_page(vma)))
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 18/34] mm: add gup flag to indicate "foreign" mm access
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-arch


From: Dave Hansen <dave.hansen@linux.intel.com>

We try to enforce protection keys in software the same way that we
do in hardware.  (See long example below).

But, we only want to do this when accessing our *own* process's
memory.  If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
tried to PTRACE_POKE a target process which just happened to have
some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
debugger access to that memory.  PKRU is fundamentally a
thread-local structure and we do not want to enforce it on access
to _another_ thread's data.

This gets especially tricky when we have workqueues or other
delayed-work mechanisms that might run in a random process's context.
We can check that we only enforce pkeys when operating on our *own* mm,
but delayed work gets performed when a random user context is active.
We might end up with a situation where a delayed-work gup fails when
running randomly under its "own" task but succeeds when running under
another process.  We want to avoid that.

To avoid that, we add a GUP flag: FOLL_FOREIGN and a fault flag:
FAULT_FLAG_FOREIGN.  They indicate that we are walking an mm
which is not guranteed to be the same as current->mm and should
not be subject to protection key enforcement.

Thanks to Jerome Glisse for pointing out this scenario.

*** Why do we enforce protection keys in software?? ***

Imagine that we disabled access to the memory pointer to by 'buf'.
The, we implemented sys_write() like this:

	sys_read(fd, buf, len...)
	{
		struct page *page = follow_page(buf);
		void *buf_mapped = kmap(page);
		memcpy(buf_mapped, fd_data, len);
		...
	}

This writes to 'buf' via a *kernel* mapping, without a protection
key.  While this implementation does the same thing:

	sys_read(fd, buf, len...)
	{
		copy_to_user(buf, fd_data, len);
		...
	}

but would hit a protection key fault because the userspace 'buf'
mapping has a protection key set.

To provide consistency, and to make key-protected memory work
as much like mprotect()ed memory as possible, we try to enforce
the same protections as the hardware would when the *kernel* walks
the page tables (and other mm structures).

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-arch@vger.kernel.org
---

 b/arch/powerpc/include/asm/mmu_context.h   |    3 ++-
 b/arch/s390/include/asm/mmu_context.h      |    3 ++-
 b/arch/unicore32/include/asm/mmu_context.h |    3 ++-
 b/arch/x86/include/asm/mmu_context.h       |    5 +++--
 b/drivers/iommu/amd_iommu_v2.c             |    8 +++++---
 b/include/asm-generic/mm_hooks.h           |    3 ++-
 b/include/linux/mm.h                       |    2 ++
 b/mm/gup.c                                 |   15 ++++++++++-----
 b/mm/ksm.c                                 |   10 ++++++++--
 b/mm/memory.c                              |    3 ++-
 10 files changed, 38 insertions(+), 17 deletions(-)

diff -puN arch/powerpc/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag arch/powerpc/include/asm/mmu_context.h
--- a/arch/powerpc/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.223698386 -0800
+++ b/arch/powerpc/include/asm/mmu_context.h	2015-12-03 16:21:26.241699202 -0800
@@ -148,7 +148,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/s390/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag arch/s390/include/asm/mmu_context.h
--- a/arch/s390/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.224698431 -0800
+++ b/arch/s390/include/asm/mmu_context.h	2015-12-03 16:21:26.242699248 -0800
@@ -130,7 +130,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/unicore32/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag arch/unicore32/include/asm/mmu_context.h
--- a/arch/unicore32/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.226698522 -0800
+++ b/arch/unicore32/include/asm/mmu_context.h	2015-12-03 16:21:26.242699248 -0800
@@ -97,7 +97,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.228698613 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:26.242699248 -0800
@@ -299,10 +299,11 @@ static inline bool vma_is_foreign(struct
 	return false;
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* allow access if the VMA is not one from this process */
-	if (vma_is_foreign(vma))
+	if (foreign || vma_is_foreign(vma))
 		return true;
 	return __pkru_allows_pkey(vma_pkey(vma), write);
 }
diff -puN drivers/iommu/amd_iommu_v2.c~pkeys-12-gup-fault-foreign-flag drivers/iommu/amd_iommu_v2.c
--- a/drivers/iommu/amd_iommu_v2.c~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.229698658 -0800
+++ b/drivers/iommu/amd_iommu_v2.c	2015-12-03 16:21:26.243699293 -0800
@@ -500,9 +500,11 @@ static void do_fault(struct work_struct
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	u64 address;
-	int ret, write;
+	int ret, flags;
 
-	write = !!(fault->flags & PPR_FAULT_WRITE);
+	if (fault->flags & PPR_FAULT_WRITE)
+		flags = FAULT_FLAG_WRITE;
+	flags |= FAULT_FLAG_FOREIGN;
 
 	mm = fault->state->mm;
 	address = fault->address;
@@ -523,7 +525,7 @@ static void do_fault(struct work_struct
 		goto out;
 	}
 
-	ret = handle_mm_fault(mm, vma, address, write);
+	ret = handle_mm_fault(mm, vma, address, flags);
 	if (ret & VM_FAULT_ERROR) {
 		/* failed to service fault */
 		up_read(&mm->mmap_sem);
diff -puN include/asm-generic/mm_hooks.h~pkeys-12-gup-fault-foreign-flag include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.231698749 -0800
+++ b/include/asm-generic/mm_hooks.h	2015-12-03 16:21:26.243699293 -0800
@@ -26,7 +26,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN include/linux/mm.h~pkeys-12-gup-fault-foreign-flag include/linux/mm.h
--- a/include/linux/mm.h~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.233698839 -0800
+++ b/include/linux/mm.h	2015-12-03 16:21:26.244699338 -0800
@@ -232,6 +232,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x20	/* Second try */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
+#define FAULT_FLAG_FOREIGN	0x80	/* faulting for non current tsk/mm */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -2138,6 +2139,7 @@ static inline struct page *follow_page(s
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
 #define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
 #define FOLL_MLOCK	0x1000	/* lock present pages */
+#define FOLL_FOREIGN	0x2000	/* we are working on non-current tsk/mm */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff -puN mm/gup.c~pkeys-12-gup-fault-foreign-flag mm/gup.c
--- a/mm/gup.c~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.234698885 -0800
+++ b/mm/gup.c	2015-12-03 16:21:26.245699384 -0800
@@ -310,6 +310,8 @@ static int faultin_page(struct task_stru
 		return -ENOENT;
 	if (*flags & FOLL_WRITE)
 		fault_flags |= FAULT_FLAG_WRITE;
+	if (*flags & FOLL_FOREIGN)
+		fault_flags |= FAULT_FLAG_FOREIGN;
 	if (nonblocking)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 	if (*flags & FOLL_NOWAIT)
@@ -360,11 +362,13 @@ static int faultin_page(struct task_stru
 static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 {
 	vm_flags_t vm_flags = vma->vm_flags;
+	int write = (gup_flags & FOLL_WRITE);
+	int foreign = (gup_flags & FOLL_FOREIGN);
 
 	if (vm_flags & (VM_IO | VM_PFNMAP))
 		return -EFAULT;
 
-	if (gup_flags & FOLL_WRITE) {
+	if (write) {
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;
@@ -392,7 +396,7 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
-	if (!arch_vma_access_permitted(vma, (gup_flags & FOLL_WRITE)))
+	if (!arch_vma_access_permitted(vma, write, foreign))
 		return -EFAULT;
 	return 0;
 }
@@ -562,7 +566,8 @@ EXPORT_SYMBOL(__get_user_pages);
 
 bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 {
-	bool write = !!(fault_flags & FAULT_FLAG_WRITE);
+	bool write   = !!(fault_flags & FAULT_FLAG_WRITE);
+	bool foreign = !!(fault_flags & FAULT_FLAG_FOREIGN);
 	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
 
 	if (!(vm_flags & vma->vm_flags))
@@ -570,9 +575,9 @@ bool vma_permits_fault(struct vm_area_st
 
 	/*
 	 * The architecture might have a hardware protection
-	 * mechanism other than read/write that can deny access
+	 * mechanism other than read/write that can deny access.
 	 */
-	if (!arch_vma_access_permitted(vma, write))
+	if (!arch_vma_access_permitted(vma, write, foreign))
 		return false;
 
 	return true;
diff -puN mm/ksm.c~pkeys-12-gup-fault-foreign-flag mm/ksm.c
--- a/mm/ksm.c~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.236698975 -0800
+++ b/mm/ksm.c	2015-12-03 16:21:26.246699429 -0800
@@ -359,6 +359,10 @@ static inline bool ksm_test_exit(struct
  * in case the application has unmapped and remapped mm,addr meanwhile.
  * Could a ksm page appear anywhere else?  Actually yes, in a VM_PFNMAP
  * mmap of /dev/mem or /dev/kmem, where we would not want to touch it.
+ *
+ * FAULT_FLAG/FOLL_FOREIGN are because we do this outside the context
+ * of the process that owns 'vma'.  We also do not want to enforce
+ * protection keys here anyway.
  */
 static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 {
@@ -367,12 +371,14 @@ static int break_ksm(struct vm_area_stru
 
 	do {
 		cond_resched();
-		page = follow_page(vma, addr, FOLL_GET | FOLL_MIGRATION);
+		page = follow_page(vma, addr,
+				FOLL_GET | FOLL_MIGRATION | FOLL_FOREIGN);
 		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma->vm_mm, vma, addr,
-							FAULT_FLAG_WRITE);
+							FAULT_FLAG_WRITE |
+							FAULT_FLAG_FOREIGN);
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
diff -puN mm/memory.c~pkeys-12-gup-fault-foreign-flag mm/memory.c
--- a/mm/memory.c~pkeys-12-gup-fault-foreign-flag	2015-12-03 16:21:26.238699066 -0800
+++ b/mm/memory.c	2015-12-03 16:21:26.247699474 -0800
@@ -3345,7 +3345,8 @@ static int __handle_mm_fault(struct mm_s
 	pmd_t *pmd;
 	pte_t *pte;
 
-	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE))
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+					    flags & FAULT_FLAG_FOREIGN))
 		return VM_FAULT_SIGSEGV;
 
 	if (unlikely(is_vm_hugetlb_page(vma)))
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 19/34] x86, pkeys: optimize fault handling in access_error()
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We might not strictly have to make modifictions to
access_error() to check the VMA here.

If we do not, we will do this:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault(), allocates and maps page, sets pte.pkey=K
4. return to userspace
5. touch instruction reexecutes, but triggers PF_PK
6. do PKEY signal

What happens with this patch applied:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault() notices that K is inaccessible
4. do PKEY signal

We basically skip the fault that does an allocation.

So what this lets us do is protect areas from even being
*populated* unless it is accessible according to protection
keys.  That seems handy to me and makes protection keys work
more like an mprotect()'d mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/fault.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-15-access_error arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-15-access_error	2015-12-03 16:21:26.872727820 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:26.876728002 -0800
@@ -900,10 +900,16 @@ bad_area(struct pt_regs *regs, unsigned
 static inline bool bad_area_access_from_pkeys(unsigned long error_code,
 		struct vm_area_struct *vma)
 {
+	/* This code is always called on the current mm */
+	int foreign = 0;
+
 	if (!boot_cpu_has(X86_FEATURE_OSPKE))
 		return false;
 	if (error_code & PF_PK)
 		return true;
+	/* this checks permission keys on the VMA: */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+		return true;
 	return false;
 }
 
@@ -1091,6 +1097,8 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/* This is only called for the current mm, so: */
+	int foreign = 0;
 	/*
 	 * Access or read was blocked by protection keys. We do
 	 * this check before any others because we do not want
@@ -1099,6 +1107,13 @@ access_error(unsigned long error_code, s
 	 */
 	if (error_code & PF_PK)
 		return 1;
+	/*
+	 * Make sure to check the VMA so that we do not perform
+	 * faults just to hit a PF_PK as soon as we fill in a
+	 * page.
+	 */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+		return 1;
 
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 19/34] x86, pkeys: optimize fault handling in access_error()
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We might not strictly have to make modifictions to
access_error() to check the VMA here.

If we do not, we will do this:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault(), allocates and maps page, sets pte.pkey=K
4. return to userspace
5. touch instruction reexecutes, but triggers PF_PK
6. do PKEY signal

What happens with this patch applied:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault() notices that K is inaccessible
4. do PKEY signal

We basically skip the fault that does an allocation.

So what this lets us do is protect areas from even being
*populated* unless it is accessible according to protection
keys.  That seems handy to me and makes protection keys work
more like an mprotect()'d mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/fault.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-15-access_error arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-15-access_error	2015-12-03 16:21:26.872727820 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:26.876728002 -0800
@@ -900,10 +900,16 @@ bad_area(struct pt_regs *regs, unsigned
 static inline bool bad_area_access_from_pkeys(unsigned long error_code,
 		struct vm_area_struct *vma)
 {
+	/* This code is always called on the current mm */
+	int foreign = 0;
+
 	if (!boot_cpu_has(X86_FEATURE_OSPKE))
 		return false;
 	if (error_code & PF_PK)
 		return true;
+	/* this checks permission keys on the VMA: */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+		return true;
 	return false;
 }
 
@@ -1091,6 +1097,8 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/* This is only called for the current mm, so: */
+	int foreign = 0;
 	/*
 	 * Access or read was blocked by protection keys. We do
 	 * this check before any others because we do not want
@@ -1099,6 +1107,13 @@ access_error(unsigned long error_code, s
 	 */
 	if (error_code & PF_PK)
 		return 1;
+	/*
+	 * Make sure to check the VMA so that we do not perform
+	 * faults just to hit a PF_PK as soon as we fill in a
+	 * page.
+	 */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+		return 1;
 
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 20/34] x86, pkeys: differentiate instruction fetches
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

As discussed earlier, we attempt to enforce protection keys in
software.

However, the code checks all faults to ensure that they are not
violating protection key permissions.  It was assumed that all
faults are either write faults where we check PKRU[key].WD (write
disable) or read faults where we check the AD (access disable)
bit.

But, there is a third category of faults for protection keys:
instruction faults.  Instruction faults never run afoul of
protection keys because they do not affect instruction fetches.

So, plumb the PF_INSTR bit down in to the
arch_vma_access_permitted() function where we do the protection
key checks.

We also add a new FAULT_FLAG_INSTRUCTION.  This is because
handle_mm_fault() is not passed the architecture-specific
error_code where we keep PF_INSTR, so we need to encode the
instruction fetch information in to the arch-generic fault
flags.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/powerpc/include/asm/mmu_context.h |    2 +-
 b/arch/s390/include/asm/mmu_context.h    |    2 +-
 b/arch/x86/include/asm/mmu_context.h     |    5 ++++-
 b/arch/x86/mm/fault.c                    |    8 ++++++--
 b/include/asm-generic/mm_hooks.h         |    2 +-
 b/include/linux/mm.h                     |    1 +
 b/mm/gup.c                               |   11 +++++++++--
 b/mm/memory.c                            |    1 +
 8 files changed, 24 insertions(+), 8 deletions(-)

diff -puN arch/powerpc/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable arch/powerpc/include/asm/mmu_context.h
--- a/arch/powerpc/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.286746596 -0800
+++ b/arch/powerpc/include/asm/mmu_context.h	2015-12-03 16:21:27.301747276 -0800
@@ -149,7 +149,7 @@ static inline void arch_bprm_mm_init(str
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/s390/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable arch/s390/include/asm/mmu_context.h
--- a/arch/s390/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.288746687 -0800
+++ b/arch/s390/include/asm/mmu_context.h	2015-12-03 16:21:27.302747322 -0800
@@ -131,7 +131,7 @@ static inline void arch_bprm_mm_init(str
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.289746732 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:27.302747322 -0800
@@ -300,8 +300,11 @@ static inline bool vma_is_foreign(struct
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
+	/* pkeys never affect instruction fetches */
+	if (execute)
+		return true;
 	/* allow access if the VMA is not one from this process */
 	if (foreign || vma_is_foreign(vma))
 		return true;
diff -puN arch/x86/mm/fault.c~pkeys-allow-execute-on-unreadable arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.291746823 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:27.303747367 -0800
@@ -908,7 +908,8 @@ static inline bool bad_area_access_from_
 	if (error_code & PF_PK)
 		return true;
 	/* this checks permission keys on the VMA: */
-	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE),
+				(error_code & PF_INSTR), foreign))
 		return true;
 	return false;
 }
@@ -1112,7 +1113,8 @@ access_error(unsigned long error_code, s
 	 * faults just to hit a PF_PK as soon as we fill in a
 	 * page.
 	 */
-	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE),
+				(error_code & PF_INSTR), foreign))
 		return 1;
 
 	if (error_code & PF_WRITE) {
@@ -1267,6 +1269,8 @@ __do_page_fault(struct pt_regs *regs, un
 
 	if (error_code & PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
+	if (error_code & PF_INSTR)
+		flags |= FAULT_FLAG_INSTRUCTION;
 
 	/*
 	 * When running in the kernel we expect faults to occur only to
diff -puN include/asm-generic/mm_hooks.h~pkeys-allow-execute-on-unreadable include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.293746914 -0800
+++ b/include/asm-generic/mm_hooks.h	2015-12-03 16:21:27.303747367 -0800
@@ -27,7 +27,7 @@ static inline void arch_bprm_mm_init(str
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN include/linux/mm.h~pkeys-allow-execute-on-unreadable include/linux/mm.h
--- a/include/linux/mm.h~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.294746959 -0800
+++ b/include/linux/mm.h	2015-12-03 16:21:27.304747413 -0800
@@ -233,6 +233,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_TRIED	0x20	/* Second try */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_FOREIGN	0x80	/* faulting for non current tsk/mm */
+#define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
diff -puN mm/gup.c~pkeys-allow-execute-on-unreadable mm/gup.c
--- a/mm/gup.c~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.296747050 -0800
+++ b/mm/gup.c	2015-12-03 16:21:27.304747413 -0800
@@ -396,7 +396,11 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
-	if (!arch_vma_access_permitted(vma, write, foreign))
+	/*
+	 * gups are always data accesses, not instruction
+	 * fetches, so execute=0 here
+	 */
+	if (!arch_vma_access_permitted(vma, write, 0, foreign))
 		return -EFAULT;
 	return 0;
 }
@@ -576,8 +580,11 @@ bool vma_permits_fault(struct vm_area_st
 	/*
 	 * The architecture might have a hardware protection
 	 * mechanism other than read/write that can deny access.
+	 *
+	 * gup always represents data access, not instruction
+	 * fetches, so execute=0 here:
 	 */
-	if (!arch_vma_access_permitted(vma, write, foreign))
+	if (!arch_vma_access_permitted(vma, write, 0, foreign))
 		return false;
 
 	return true;
diff -puN mm/memory.c~pkeys-allow-execute-on-unreadable mm/memory.c
--- a/mm/memory.c~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.298747141 -0800
+++ b/mm/memory.c	2015-12-03 16:21:27.306747503 -0800
@@ -3346,6 +3346,7 @@ static int __handle_mm_fault(struct mm_s
 	pte_t *pte;
 
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_FOREIGN))
 		return VM_FAULT_SIGSEGV;
 
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 20/34] x86, pkeys: differentiate instruction fetches
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

As discussed earlier, we attempt to enforce protection keys in
software.

However, the code checks all faults to ensure that they are not
violating protection key permissions.  It was assumed that all
faults are either write faults where we check PKRU[key].WD (write
disable) or read faults where we check the AD (access disable)
bit.

But, there is a third category of faults for protection keys:
instruction faults.  Instruction faults never run afoul of
protection keys because they do not affect instruction fetches.

So, plumb the PF_INSTR bit down in to the
arch_vma_access_permitted() function where we do the protection
key checks.

We also add a new FAULT_FLAG_INSTRUCTION.  This is because
handle_mm_fault() is not passed the architecture-specific
error_code where we keep PF_INSTR, so we need to encode the
instruction fetch information in to the arch-generic fault
flags.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/powerpc/include/asm/mmu_context.h |    2 +-
 b/arch/s390/include/asm/mmu_context.h    |    2 +-
 b/arch/x86/include/asm/mmu_context.h     |    5 ++++-
 b/arch/x86/mm/fault.c                    |    8 ++++++--
 b/include/asm-generic/mm_hooks.h         |    2 +-
 b/include/linux/mm.h                     |    1 +
 b/mm/gup.c                               |   11 +++++++++--
 b/mm/memory.c                            |    1 +
 8 files changed, 24 insertions(+), 8 deletions(-)

diff -puN arch/powerpc/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable arch/powerpc/include/asm/mmu_context.h
--- a/arch/powerpc/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.286746596 -0800
+++ b/arch/powerpc/include/asm/mmu_context.h	2015-12-03 16:21:27.301747276 -0800
@@ -149,7 +149,7 @@ static inline void arch_bprm_mm_init(str
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/s390/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable arch/s390/include/asm/mmu_context.h
--- a/arch/s390/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.288746687 -0800
+++ b/arch/s390/include/asm/mmu_context.h	2015-12-03 16:21:27.302747322 -0800
@@ -131,7 +131,7 @@ static inline void arch_bprm_mm_init(str
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.289746732 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:27.302747322 -0800
@@ -300,8 +300,11 @@ static inline bool vma_is_foreign(struct
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
+	/* pkeys never affect instruction fetches */
+	if (execute)
+		return true;
 	/* allow access if the VMA is not one from this process */
 	if (foreign || vma_is_foreign(vma))
 		return true;
diff -puN arch/x86/mm/fault.c~pkeys-allow-execute-on-unreadable arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.291746823 -0800
+++ b/arch/x86/mm/fault.c	2015-12-03 16:21:27.303747367 -0800
@@ -908,7 +908,8 @@ static inline bool bad_area_access_from_
 	if (error_code & PF_PK)
 		return true;
 	/* this checks permission keys on the VMA: */
-	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE),
+				(error_code & PF_INSTR), foreign))
 		return true;
 	return false;
 }
@@ -1112,7 +1113,8 @@ access_error(unsigned long error_code, s
 	 * faults just to hit a PF_PK as soon as we fill in a
 	 * page.
 	 */
-	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE),
+				(error_code & PF_INSTR), foreign))
 		return 1;
 
 	if (error_code & PF_WRITE) {
@@ -1267,6 +1269,8 @@ __do_page_fault(struct pt_regs *regs, un
 
 	if (error_code & PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
+	if (error_code & PF_INSTR)
+		flags |= FAULT_FLAG_INSTRUCTION;
 
 	/*
 	 * When running in the kernel we expect faults to occur only to
diff -puN include/asm-generic/mm_hooks.h~pkeys-allow-execute-on-unreadable include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.293746914 -0800
+++ b/include/asm-generic/mm_hooks.h	2015-12-03 16:21:27.303747367 -0800
@@ -27,7 +27,7 @@ static inline void arch_bprm_mm_init(str
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN include/linux/mm.h~pkeys-allow-execute-on-unreadable include/linux/mm.h
--- a/include/linux/mm.h~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.294746959 -0800
+++ b/include/linux/mm.h	2015-12-03 16:21:27.304747413 -0800
@@ -233,6 +233,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_TRIED	0x20	/* Second try */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_FOREIGN	0x80	/* faulting for non current tsk/mm */
+#define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
diff -puN mm/gup.c~pkeys-allow-execute-on-unreadable mm/gup.c
--- a/mm/gup.c~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.296747050 -0800
+++ b/mm/gup.c	2015-12-03 16:21:27.304747413 -0800
@@ -396,7 +396,11 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
-	if (!arch_vma_access_permitted(vma, write, foreign))
+	/*
+	 * gups are always data accesses, not instruction
+	 * fetches, so execute=0 here
+	 */
+	if (!arch_vma_access_permitted(vma, write, 0, foreign))
 		return -EFAULT;
 	return 0;
 }
@@ -576,8 +580,11 @@ bool vma_permits_fault(struct vm_area_st
 	/*
 	 * The architecture might have a hardware protection
 	 * mechanism other than read/write that can deny access.
+	 *
+	 * gup always represents data access, not instruction
+	 * fetches, so execute=0 here:
 	 */
-	if (!arch_vma_access_permitted(vma, write, foreign))
+	if (!arch_vma_access_permitted(vma, write, 0, foreign))
 		return false;
 
 	return true;
diff -puN mm/memory.c~pkeys-allow-execute-on-unreadable mm/memory.c
--- a/mm/memory.c~pkeys-allow-execute-on-unreadable	2015-12-03 16:21:27.298747141 -0800
+++ b/mm/memory.c	2015-12-03 16:21:27.306747503 -0800
@@ -3346,6 +3346,7 @@ static int __handle_mm_fault(struct mm_s
 	pte_t *pte;
 
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_FOREIGN))
 		return VM_FAULT_SIGSEGV;
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 21/34] x86, pkeys: dump PKRU with other kernel registers
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I'm a bit ambivalent about whether this is needed or not.

Protection Keys never affect kernel mappings.  But, they can
affect whether the kernel will fault when it touches a user
mapping.  But, the kernel doesn't touch user mappings without
some careful choreography and these accesses don't generally
result in oopses.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/kernel/process_64.c |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps	2015-12-03 16:21:27.874773264 -0800
+++ b/arch/x86/kernel/process_64.c	2015-12-03 16:21:27.877773400 -0800
@@ -116,6 +116,8 @@ void __show_regs(struct pt_regs *regs, i
 	printk(KERN_DEFAULT "DR0: %016lx DR1: %016lx DR2: %016lx\n", d0, d1, d2);
 	printk(KERN_DEFAULT "DR3: %016lx DR6: %016lx DR7: %016lx\n", d3, d6, d7);
 
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		printk(KERN_DEFAULT "PKRU: %08x\n", read_pkru());
 }
 
 void release_thread(struct task_struct *dead_task)
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 21/34] x86, pkeys: dump PKRU with other kernel registers
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I'm a bit ambivalent about whether this is needed or not.

Protection Keys never affect kernel mappings.  But, they can
affect whether the kernel will fault when it touches a user
mapping.  But, the kernel doesn't touch user mappings without
some careful choreography and these accesses don't generally
result in oopses.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/kernel/process_64.c |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps	2015-12-03 16:21:27.874773264 -0800
+++ b/arch/x86/kernel/process_64.c	2015-12-03 16:21:27.877773400 -0800
@@ -116,6 +116,8 @@ void __show_regs(struct pt_regs *regs, i
 	printk(KERN_DEFAULT "DR0: %016lx DR1: %016lx DR2: %016lx\n", d0, d1, d2);
 	printk(KERN_DEFAULT "DR3: %016lx DR6: %016lx DR7: %016lx\n", d3, d6, d7);
 
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		printk(KERN_DEFAULT "PKRU: %08x\n", read_pkru());
 }
 
 void release_thread(struct task_struct *dead_task)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 22/34] x86, pkeys: dump PTE pkey in /proc/pid/smaps
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The protection key can now be just as important as read/write
permissions on a VMA.  We need some debug mechanism to help
figure out if it is in play.  smaps seems like a logical
place to expose it.

arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.

We also use no #ifdef.  If protection keys is .config'd out
we will get the same function as if we used the weak generic
function.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/kernel/setup.c |    9 +++++++++
 b/fs/proc/task_mmu.c      |    5 +++++
 2 files changed, 14 insertions(+)

diff -puN arch/x86/kernel/setup.c~pkeys-40-smaps arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~pkeys-40-smaps	2015-12-03 16:21:28.284791859 -0800
+++ b/arch/x86/kernel/setup.c	2015-12-03 16:21:28.289792086 -0800
@@ -112,6 +112,7 @@
 #include <asm/alternative.h>
 #include <asm/prom.h>
 #include <asm/microcode.h>
+#include <asm/mmu_context.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1282,3 +1283,11 @@ static int __init register_kernel_offset
 	return 0;
 }
 __initcall(register_kernel_offset_dumper);
+
+void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
+}
diff -puN fs/proc/task_mmu.c~pkeys-40-smaps fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~pkeys-40-smaps	2015-12-03 16:21:28.285791904 -0800
+++ b/fs/proc/task_mmu.c	2015-12-03 16:21:28.290792131 -0800
@@ -657,6 +657,10 @@ static int smaps_hugetlb_range(pte_t *pt
 }
 #endif /* HUGETLB_PAGE */
 
+void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+}
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
@@ -713,6 +717,7 @@ static int show_smap(struct seq_file *m,
 		   (vma->vm_flags & VM_LOCKED) ?
 			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
 
+	arch_show_smap(m, vma);
 	show_smap_vma_flags(m, vma);
 	m_cache_vma(m, vma);
 	return 0;
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 22/34] x86, pkeys: dump PTE pkey in /proc/pid/smaps
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The protection key can now be just as important as read/write
permissions on a VMA.  We need some debug mechanism to help
figure out if it is in play.  smaps seems like a logical
place to expose it.

arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.

We also use no #ifdef.  If protection keys is .config'd out
we will get the same function as if we used the weak generic
function.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/kernel/setup.c |    9 +++++++++
 b/fs/proc/task_mmu.c      |    5 +++++
 2 files changed, 14 insertions(+)

diff -puN arch/x86/kernel/setup.c~pkeys-40-smaps arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~pkeys-40-smaps	2015-12-03 16:21:28.284791859 -0800
+++ b/arch/x86/kernel/setup.c	2015-12-03 16:21:28.289792086 -0800
@@ -112,6 +112,7 @@
 #include <asm/alternative.h>
 #include <asm/prom.h>
 #include <asm/microcode.h>
+#include <asm/mmu_context.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1282,3 +1283,11 @@ static int __init register_kernel_offset
 	return 0;
 }
 __initcall(register_kernel_offset_dumper);
+
+void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
+}
diff -puN fs/proc/task_mmu.c~pkeys-40-smaps fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~pkeys-40-smaps	2015-12-03 16:21:28.285791904 -0800
+++ b/fs/proc/task_mmu.c	2015-12-03 16:21:28.290792131 -0800
@@ -657,6 +657,10 @@ static int smaps_hugetlb_range(pte_t *pt
 }
 #endif /* HUGETLB_PAGE */
 
+void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+}
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
@@ -713,6 +717,7 @@ static int show_smap(struct seq_file *m,
 		   (vma->vm_flags & VM_LOCKED) ?
 			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
 
+	arch_show_smap(m, vma);
 	show_smap_vma_flags(m, vma);
 	m_cache_vma(m, vma);
 	return 0;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 23/34] x86, pkeys: add Kconfig prompt to existing config option
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I don't have a strong opinion on whether we need this or not.
Protection Keys has relatively little code associated with it,
and it is not a heavyweight feature to keep enabled.  However,
I can imagine that folks would still appreciate being able to
disable it.

Here's the option if folks want it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-40-kconfig-prompt arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-40-kconfig-prompt	2015-12-03 16:21:28.726811905 -0800
+++ b/arch/x86/Kconfig	2015-12-03 16:21:28.730812086 -0800
@@ -1682,8 +1682,18 @@ config X86_INTEL_MPX
 	  If unsure, say N.
 
 config X86_INTEL_MEMORY_PROTECTION_KEYS
+	prompt "Intel Memory Protection Keys"
 	def_bool y
+	# Note: only available in 64-bit mode
 	depends on CPU_SUP_INTEL && X86_64
+	---help---
+	  Memory Protection Keys provides a mechanism for enforcing
+	  page-based protections, but without requiring modification of the
+	  page tables when an application changes protection domains.
+
+	  For details, see Documentation/x86/protection-keys.txt
+
+	  If unsure, say y.
 
 config EFI
 	bool "EFI runtime service support"
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 23/34] x86, pkeys: add Kconfig prompt to existing config option
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I don't have a strong opinion on whether we need this or not.
Protection Keys has relatively little code associated with it,
and it is not a heavyweight feature to keep enabled.  However,
I can imagine that folks would still appreciate being able to
disable it.

Here's the option if folks want it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-40-kconfig-prompt arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-40-kconfig-prompt	2015-12-03 16:21:28.726811905 -0800
+++ b/arch/x86/Kconfig	2015-12-03 16:21:28.730812086 -0800
@@ -1682,8 +1682,18 @@ config X86_INTEL_MPX
 	  If unsure, say N.
 
 config X86_INTEL_MEMORY_PROTECTION_KEYS
+	prompt "Intel Memory Protection Keys"
 	def_bool y
+	# Note: only available in 64-bit mode
 	depends on CPU_SUP_INTEL && X86_64
+	---help---
+	  Memory Protection Keys provides a mechanism for enforcing
+	  page-based protections, but without requiring modification of the
+	  page tables when an application changes protection domains.
+
+	  For details, see Documentation/x86/protection-keys.txt
+
+	  If unsure, say y.
 
 config EFI
 	bool "EFI runtime service support"
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 24/34] mm, multi-arch: pass a protection key in to calc_vm_flag_bits()
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-api, linux-arch


From: Dave Hansen <dave.hansen@linux.intel.com>

This plumbs a protection key through calc_vm_flag_bits().  We
could have done this in calc_vm_prot_bits(), but I did not feel
super strongly which way to go.  It was pretty arbitrary which
one to use.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
---

 b/arch/powerpc/include/asm/mman.h  |    5 +++--
 b/drivers/char/agp/frontend.c      |    2 +-
 b/drivers/staging/android/ashmem.c |    4 ++--
 b/include/linux/mman.h             |    6 +++---
 b/mm/mmap.c                        |    2 +-
 b/mm/mprotect.c                    |    2 +-
 b/mm/nommu.c                       |    2 +-
 7 files changed, 12 insertions(+), 11 deletions(-)

diff -puN arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits arch/powerpc/include/asm/mman.h
--- a/arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.142830772 -0800
+++ b/arch/powerpc/include/asm/mman.h	2015-12-03 16:21:29.155831361 -0800
@@ -18,11 +18,12 @@
  * This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
  * here.  How important is the optimization?
  */
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot)
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+		unsigned long pkey)
 {
 	return (prot & PROT_SAO) ? VM_SAO : 0;
 }
-#define arch_calc_vm_prot_bits(prot) arch_calc_vm_prot_bits(prot)
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
 static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 {
diff -puN drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits drivers/char/agp/frontend.c
--- a/drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.143830817 -0800
+++ b/drivers/char/agp/frontend.c	2015-12-03 16:21:29.155831361 -0800
@@ -156,7 +156,7 @@ static pgprot_t agp_convert_mmap_flags(i
 {
 	unsigned long prot_bits;
 
-	prot_bits = calc_vm_prot_bits(prot) | VM_SHARED;
+	prot_bits = calc_vm_prot_bits(prot, 0) | VM_SHARED;
 	return vm_get_page_prot(prot_bits);
 }
 
diff -puN drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits drivers/staging/android/ashmem.c
--- a/drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.145830908 -0800
+++ b/drivers/staging/android/ashmem.c	2015-12-03 16:21:29.156831407 -0800
@@ -372,8 +372,8 @@ static int ashmem_mmap(struct file *file
 	}
 
 	/* requested protection bits must match our allowed protection mask */
-	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask)) &
-		     calc_vm_prot_bits(PROT_MASK))) {
+	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask, 0)) &
+		     calc_vm_prot_bits(PROT_MASK, 0))) {
 		ret = -EPERM;
 		goto out;
 	}
diff -puN include/linux/mman.h~pkeys-84-calc_vm_prot_bits include/linux/mman.h
--- a/include/linux/mman.h~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.147830999 -0800
+++ b/include/linux/mman.h	2015-12-03 16:21:29.156831407 -0800
@@ -35,7 +35,7 @@ static inline void vm_unacct_memory(long
  */
 
 #ifndef arch_calc_vm_prot_bits
-#define arch_calc_vm_prot_bits(prot) 0
+#define arch_calc_vm_prot_bits(prot, pkey) 0
 #endif
 
 #ifndef arch_vm_get_page_prot
@@ -70,12 +70,12 @@ static inline int arch_validate_prot(uns
  * Combine the mmap "prot" argument into "vm_flags" used internally.
  */
 static inline unsigned long
-calc_vm_prot_bits(unsigned long prot)
+calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
-	       arch_calc_vm_prot_bits(prot);
+	       arch_calc_vm_prot_bits(prot, pkey);
 }
 
 /*
diff -puN mm/mmap.c~pkeys-84-calc_vm_prot_bits mm/mmap.c
--- a/mm/mmap.c~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.148831044 -0800
+++ b/mm/mmap.c	2015-12-03 16:21:29.157831452 -0800
@@ -1309,7 +1309,7 @@ unsigned long do_mmap(struct file *file,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
diff -puN mm/mprotect.c~pkeys-84-calc_vm_prot_bits mm/mprotect.c
--- a/mm/mprotect.c~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.150831135 -0800
+++ b/mm/mprotect.c	2015-12-03 16:21:29.158831497 -0800
@@ -373,7 +373,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot);
+	vm_flags = calc_vm_prot_bits(prot, 0);
 
 	down_write(&current->mm->mmap_sem);
 
diff -puN mm/nommu.c~pkeys-84-calc_vm_prot_bits mm/nommu.c
--- a/mm/nommu.c~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.152831225 -0800
+++ b/mm/nommu.c	2015-12-03 16:21:29.158831497 -0800
@@ -1090,7 +1090,7 @@ static unsigned long determine_vm_flags(
 {
 	unsigned long vm_flags;
 
-	vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags);
+	vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags);
 	/* vm_flags |= mm->def_flags; */
 
 	if (!(capabilities & NOMMU_MAP_DIRECT)) {
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 24/34] mm, multi-arch: pass a protection key in to calc_vm_flag_bits()
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-api, linux-arch


From: Dave Hansen <dave.hansen@linux.intel.com>

This plumbs a protection key through calc_vm_flag_bits().  We
could have done this in calc_vm_prot_bits(), but I did not feel
super strongly which way to go.  It was pretty arbitrary which
one to use.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
---

 b/arch/powerpc/include/asm/mman.h  |    5 +++--
 b/drivers/char/agp/frontend.c      |    2 +-
 b/drivers/staging/android/ashmem.c |    4 ++--
 b/include/linux/mman.h             |    6 +++---
 b/mm/mmap.c                        |    2 +-
 b/mm/mprotect.c                    |    2 +-
 b/mm/nommu.c                       |    2 +-
 7 files changed, 12 insertions(+), 11 deletions(-)

diff -puN arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits arch/powerpc/include/asm/mman.h
--- a/arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.142830772 -0800
+++ b/arch/powerpc/include/asm/mman.h	2015-12-03 16:21:29.155831361 -0800
@@ -18,11 +18,12 @@
  * This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
  * here.  How important is the optimization?
  */
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot)
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+		unsigned long pkey)
 {
 	return (prot & PROT_SAO) ? VM_SAO : 0;
 }
-#define arch_calc_vm_prot_bits(prot) arch_calc_vm_prot_bits(prot)
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
 static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 {
diff -puN drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits drivers/char/agp/frontend.c
--- a/drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.143830817 -0800
+++ b/drivers/char/agp/frontend.c	2015-12-03 16:21:29.155831361 -0800
@@ -156,7 +156,7 @@ static pgprot_t agp_convert_mmap_flags(i
 {
 	unsigned long prot_bits;
 
-	prot_bits = calc_vm_prot_bits(prot) | VM_SHARED;
+	prot_bits = calc_vm_prot_bits(prot, 0) | VM_SHARED;
 	return vm_get_page_prot(prot_bits);
 }
 
diff -puN drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits drivers/staging/android/ashmem.c
--- a/drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.145830908 -0800
+++ b/drivers/staging/android/ashmem.c	2015-12-03 16:21:29.156831407 -0800
@@ -372,8 +372,8 @@ static int ashmem_mmap(struct file *file
 	}
 
 	/* requested protection bits must match our allowed protection mask */
-	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask)) &
-		     calc_vm_prot_bits(PROT_MASK))) {
+	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask, 0)) &
+		     calc_vm_prot_bits(PROT_MASK, 0))) {
 		ret = -EPERM;
 		goto out;
 	}
diff -puN include/linux/mman.h~pkeys-84-calc_vm_prot_bits include/linux/mman.h
--- a/include/linux/mman.h~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.147830999 -0800
+++ b/include/linux/mman.h	2015-12-03 16:21:29.156831407 -0800
@@ -35,7 +35,7 @@ static inline void vm_unacct_memory(long
  */
 
 #ifndef arch_calc_vm_prot_bits
-#define arch_calc_vm_prot_bits(prot) 0
+#define arch_calc_vm_prot_bits(prot, pkey) 0
 #endif
 
 #ifndef arch_vm_get_page_prot
@@ -70,12 +70,12 @@ static inline int arch_validate_prot(uns
  * Combine the mmap "prot" argument into "vm_flags" used internally.
  */
 static inline unsigned long
-calc_vm_prot_bits(unsigned long prot)
+calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
-	       arch_calc_vm_prot_bits(prot);
+	       arch_calc_vm_prot_bits(prot, pkey);
 }
 
 /*
diff -puN mm/mmap.c~pkeys-84-calc_vm_prot_bits mm/mmap.c
--- a/mm/mmap.c~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.148831044 -0800
+++ b/mm/mmap.c	2015-12-03 16:21:29.157831452 -0800
@@ -1309,7 +1309,7 @@ unsigned long do_mmap(struct file *file,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
diff -puN mm/mprotect.c~pkeys-84-calc_vm_prot_bits mm/mprotect.c
--- a/mm/mprotect.c~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.150831135 -0800
+++ b/mm/mprotect.c	2015-12-03 16:21:29.158831497 -0800
@@ -373,7 +373,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot);
+	vm_flags = calc_vm_prot_bits(prot, 0);
 
 	down_write(&current->mm->mmap_sem);
 
diff -puN mm/nommu.c~pkeys-84-calc_vm_prot_bits mm/nommu.c
--- a/mm/nommu.c~pkeys-84-calc_vm_prot_bits	2015-12-03 16:21:29.152831225 -0800
+++ b/mm/nommu.c	2015-12-03 16:21:29.158831497 -0800
@@ -1090,7 +1090,7 @@ static unsigned long determine_vm_flags(
 {
 	unsigned long vm_flags;
 
-	vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags);
+	vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags);
 	/* vm_flags |= mm->def_flags; */
 
 	if (!(capabilities & NOMMU_MAP_DIRECT)) {
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 25/34] x86, pkeys: add arch_validate_pkey()
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:14   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The syscall-level code is passed a protection key and need to
return an appropriate error code if the protection key is bogus.
We will be using this in subsequent patches.

Note that this also begins a series of arch-specific calls that
we need to expose in otherwise arch-independent code.  We create
a linux/pkeys.h header where we will put *all* the stubs for
these functions.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig             |    1 +
 b/arch/x86/include/asm/pkeys.h |   10 ++++++++++
 b/include/linux/pkeys.h        |   22 ++++++++++++++++++++++
 b/mm/Kconfig                   |    2 ++
 4 files changed, 35 insertions(+)

diff -puN /dev/null arch/x86/include/asm/pkeys.h
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/arch/x86/include/asm/pkeys.h	2015-12-03 16:21:29.710856533 -0800
@@ -0,0 +1,10 @@
+#ifndef _ASM_X86_PKEYS_H
+#define _ASM_X86_PKEYS_H
+
+#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ?      \
+				CONFIG_NR_PROTECTION_KEYS : 1)
+#define arch_validate_pkey(pkey) (((pkey) >= 0) && ((pkey) < arch_max_pkey()))
+
+#endif /*_ASM_X86_PKEYS_H */
+
+
diff -puN arch/x86/Kconfig~pkeys-15-arch_validate_peky arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-15-arch_validate_peky	2015-12-03 16:21:29.705856306 -0800
+++ b/arch/x86/Kconfig	2015-12-03 16:21:29.711856578 -0800
@@ -153,6 +153,7 @@ config X86
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN /dev/null include/linux/pkeys.h
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/include/linux/pkeys.h	2015-12-03 16:21:29.711856578 -0800
@@ -0,0 +1,22 @@
+#ifndef _LINUX_PKEYS_H
+#define _LINUX_PKEYS_H
+
+#include <linux/mm_types.h>
+#include <asm/mmu_context.h>
+
+#ifdef CONFIG_ARCH_HAS_PKEYS
+#include <asm/pkeys.h>
+#else /* ! CONFIG_ARCH_HAS_PKEYS */
+
+/*
+ * This is called from mprotect_pkey().
+ *
+ * Returns true if the protection keys is valid.
+ */
+static inline bool arch_validate_pkey(int key)
+{
+	return true;
+}
+#endif /* ! CONFIG_ARCH_HAS_PKEYS */
+
+#endif /* _LINUX_PKEYS_H */
diff -puN mm/Kconfig~pkeys-15-arch_validate_peky mm/Kconfig
--- a/mm/Kconfig~pkeys-15-arch_validate_peky	2015-12-03 16:21:29.707856396 -0800
+++ b/mm/Kconfig	2015-12-03 16:21:29.711856578 -0800
@@ -671,3 +671,5 @@ config FRAME_VECTOR
 
 config ARCH_USES_HIGH_VMA_FLAGS
 	bool
+config ARCH_HAS_PKEYS
+	bool
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 25/34] x86, pkeys: add arch_validate_pkey()
@ 2015-12-04  1:14   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The syscall-level code is passed a protection key and need to
return an appropriate error code if the protection key is bogus.
We will be using this in subsequent patches.

Note that this also begins a series of arch-specific calls that
we need to expose in otherwise arch-independent code.  We create
a linux/pkeys.h header where we will put *all* the stubs for
these functions.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig             |    1 +
 b/arch/x86/include/asm/pkeys.h |   10 ++++++++++
 b/include/linux/pkeys.h        |   22 ++++++++++++++++++++++
 b/mm/Kconfig                   |    2 ++
 4 files changed, 35 insertions(+)

diff -puN /dev/null arch/x86/include/asm/pkeys.h
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/arch/x86/include/asm/pkeys.h	2015-12-03 16:21:29.710856533 -0800
@@ -0,0 +1,10 @@
+#ifndef _ASM_X86_PKEYS_H
+#define _ASM_X86_PKEYS_H
+
+#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ?      \
+				CONFIG_NR_PROTECTION_KEYS : 1)
+#define arch_validate_pkey(pkey) (((pkey) >= 0) && ((pkey) < arch_max_pkey()))
+
+#endif /*_ASM_X86_PKEYS_H */
+
+
diff -puN arch/x86/Kconfig~pkeys-15-arch_validate_peky arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-15-arch_validate_peky	2015-12-03 16:21:29.705856306 -0800
+++ b/arch/x86/Kconfig	2015-12-03 16:21:29.711856578 -0800
@@ -153,6 +153,7 @@ config X86
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN /dev/null include/linux/pkeys.h
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/include/linux/pkeys.h	2015-12-03 16:21:29.711856578 -0800
@@ -0,0 +1,22 @@
+#ifndef _LINUX_PKEYS_H
+#define _LINUX_PKEYS_H
+
+#include <linux/mm_types.h>
+#include <asm/mmu_context.h>
+
+#ifdef CONFIG_ARCH_HAS_PKEYS
+#include <asm/pkeys.h>
+#else /* ! CONFIG_ARCH_HAS_PKEYS */
+
+/*
+ * This is called from mprotect_pkey().
+ *
+ * Returns true if the protection keys is valid.
+ */
+static inline bool arch_validate_pkey(int key)
+{
+	return true;
+}
+#endif /* ! CONFIG_ARCH_HAS_PKEYS */
+
+#endif /* _LINUX_PKEYS_H */
diff -puN mm/Kconfig~pkeys-15-arch_validate_peky mm/Kconfig
--- a/mm/Kconfig~pkeys-15-arch_validate_peky	2015-12-03 16:21:29.707856396 -0800
+++ b/mm/Kconfig	2015-12-03 16:21:29.711856578 -0800
@@ -671,3 +671,5 @@ config FRAME_VECTOR
 
 config ARCH_USES_HIGH_VMA_FLAGS
 	bool
+config ARCH_HAS_PKEYS
+	bool
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 26/34] mm: implement new mprotect_key() system call
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:15   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

mprotect_key() is just like mprotect, except it also takes a
protection key as an argument.  On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.

I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.

	pkey_deny_access(11); // random pkey
	int real_prot = PROT_READ|PROT_WRITE;
	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);

This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set.

We settled on 'unsigned long' for the type of the key here.  We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/arch/x86/include/asm/mmu_context.h |   10 +++++++--
 b/include/linux/pkeys.h              |    7 +++++-
 b/mm/Kconfig                         |    7 ++++++
 b/mm/mprotect.c                      |   36 +++++++++++++++++++++++++++++------
 4 files changed, 51 insertions(+), 9 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-85-mprotect_pkey arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.181877894 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:30.190878302 -0800
@@ -4,6 +4,7 @@
 #include <asm/desc.h>
 #include <linux/atomic.h>
 #include <linux/mm_types.h>
+#include <linux/pkeys.h>
 
 #include <trace/events/tlb.h>
 
@@ -243,10 +244,14 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * If the config option is off, we get the generic version from
+ * include/linux/pkeys.h.
+ */
 static inline int vma_pkey(struct vm_area_struct *vma)
 {
 	u16 pkey = 0;
-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
 				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
 	/*
@@ -259,9 +264,10 @@ static inline int vma_pkey(struct vm_are
 	 */
 	pkey = (vma->vm_flags >> vm_pkey_shift) &
 	       (vma_pkey_mask >> vm_pkey_shift);
-#endif
+
 	return pkey;
 }
+#endif
 
 static inline bool __pkru_allows_pkey(u16 pkey, bool write)
 {
diff -puN include/linux/pkeys.h~pkeys-85-mprotect_pkey include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.183877985 -0800
+++ b/include/linux/pkeys.h	2015-12-03 16:21:30.190878302 -0800
@@ -2,10 +2,10 @@
 #define _LINUX_PKEYS_H
 
 #include <linux/mm_types.h>
-#include <asm/mmu_context.h>
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
 #include <asm/pkeys.h>
+#include <asm/mmu_context.h>
 #else /* ! CONFIG_ARCH_HAS_PKEYS */
 
 /*
@@ -17,6 +17,11 @@ static inline bool arch_validate_pkey(in
 {
 	return true;
 }
+
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+	return 0;
+}
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
diff -puN mm/Kconfig~pkeys-85-mprotect_pkey mm/Kconfig
--- a/mm/Kconfig~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.185878075 -0800
+++ b/mm/Kconfig	2015-12-03 16:21:30.190878302 -0800
@@ -673,3 +673,10 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+
+config NR_PROTECTION_KEYS
+	int
+	# Everything supports a _single_ key, so allow folks to
+	# at least call APIs that take keys, but require that the
+	# key be 0.
+	default 1
diff -puN mm/mprotect.c~pkeys-85-mprotect_pkey mm/mprotect.c
--- a/mm/mprotect.c~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.186878121 -0800
+++ b/mm/mprotect.c	2015-12-03 16:21:30.191878347 -0800
@@ -24,6 +24,7 @@
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
 #include <linux/ksm.h>
+#include <linux/pkeys.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -344,10 +345,13 @@ fail:
 	return error;
 }
 
-SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
-		unsigned long, prot)
+/*
+ * pkey=-1 when doing a legacy mprotect()
+ */
+static int do_mprotect_pkey(unsigned long start, size_t len,
+		unsigned long prot, int pkey)
 {
-	unsigned long vm_flags, nstart, end, tmp, reqprot;
+	unsigned long nstart, end, tmp, reqprot;
 	struct vm_area_struct *vma, *prev;
 	int error = -EINVAL;
 	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
@@ -373,8 +377,6 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot, 0);
-
 	down_write(&current->mm->mmap_sem);
 
 	vma = find_vma(current->mm, start);
@@ -407,7 +409,14 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 
 		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
-		newflags = vm_flags;
+		/*
+		 * If this is a vanilla, non-pkey mprotect, inherit the
+		 * pkey from the VMA we are working on.
+		 */
+		if (pkey == -1)
+			newflags = calc_vm_prot_bits(prot, vma_pkey(vma));
+		else
+			newflags = calc_vm_prot_bits(prot, pkey);
 		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
@@ -443,3 +452,18 @@ out:
 	up_write(&current->mm->mmap_sem);
 	return error;
 }
+
+SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot)
+{
+	return do_mprotect_pkey(start, len, prot, -1);
+}
+
+SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot, int, pkey)
+{
+	if (!arch_validate_pkey(pkey))
+		return -EINVAL;
+
+	return do_mprotect_pkey(start, len, prot, pkey);
+}
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-04  1:15   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

mprotect_key() is just like mprotect, except it also takes a
protection key as an argument.  On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.

I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.

	pkey_deny_access(11); // random pkey
	int real_prot = PROT_READ|PROT_WRITE;
	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);

This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set.

We settled on 'unsigned long' for the type of the key here.  We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/arch/x86/include/asm/mmu_context.h |   10 +++++++--
 b/include/linux/pkeys.h              |    7 +++++-
 b/mm/Kconfig                         |    7 ++++++
 b/mm/mprotect.c                      |   36 +++++++++++++++++++++++++++++------
 4 files changed, 51 insertions(+), 9 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-85-mprotect_pkey arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.181877894 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:30.190878302 -0800
@@ -4,6 +4,7 @@
 #include <asm/desc.h>
 #include <linux/atomic.h>
 #include <linux/mm_types.h>
+#include <linux/pkeys.h>
 
 #include <trace/events/tlb.h>
 
@@ -243,10 +244,14 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * If the config option is off, we get the generic version from
+ * include/linux/pkeys.h.
+ */
 static inline int vma_pkey(struct vm_area_struct *vma)
 {
 	u16 pkey = 0;
-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
 				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
 	/*
@@ -259,9 +264,10 @@ static inline int vma_pkey(struct vm_are
 	 */
 	pkey = (vma->vm_flags >> vm_pkey_shift) &
 	       (vma_pkey_mask >> vm_pkey_shift);
-#endif
+
 	return pkey;
 }
+#endif
 
 static inline bool __pkru_allows_pkey(u16 pkey, bool write)
 {
diff -puN include/linux/pkeys.h~pkeys-85-mprotect_pkey include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.183877985 -0800
+++ b/include/linux/pkeys.h	2015-12-03 16:21:30.190878302 -0800
@@ -2,10 +2,10 @@
 #define _LINUX_PKEYS_H
 
 #include <linux/mm_types.h>
-#include <asm/mmu_context.h>
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
 #include <asm/pkeys.h>
+#include <asm/mmu_context.h>
 #else /* ! CONFIG_ARCH_HAS_PKEYS */
 
 /*
@@ -17,6 +17,11 @@ static inline bool arch_validate_pkey(in
 {
 	return true;
 }
+
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+	return 0;
+}
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
diff -puN mm/Kconfig~pkeys-85-mprotect_pkey mm/Kconfig
--- a/mm/Kconfig~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.185878075 -0800
+++ b/mm/Kconfig	2015-12-03 16:21:30.190878302 -0800
@@ -673,3 +673,10 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+
+config NR_PROTECTION_KEYS
+	int
+	# Everything supports a _single_ key, so allow folks to
+	# at least call APIs that take keys, but require that the
+	# key be 0.
+	default 1
diff -puN mm/mprotect.c~pkeys-85-mprotect_pkey mm/mprotect.c
--- a/mm/mprotect.c~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.186878121 -0800
+++ b/mm/mprotect.c	2015-12-03 16:21:30.191878347 -0800
@@ -24,6 +24,7 @@
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
 #include <linux/ksm.h>
+#include <linux/pkeys.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -344,10 +345,13 @@ fail:
 	return error;
 }
 
-SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
-		unsigned long, prot)
+/*
+ * pkey=-1 when doing a legacy mprotect()
+ */
+static int do_mprotect_pkey(unsigned long start, size_t len,
+		unsigned long prot, int pkey)
 {
-	unsigned long vm_flags, nstart, end, tmp, reqprot;
+	unsigned long nstart, end, tmp, reqprot;
 	struct vm_area_struct *vma, *prev;
 	int error = -EINVAL;
 	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
@@ -373,8 +377,6 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot, 0);
-
 	down_write(&current->mm->mmap_sem);
 
 	vma = find_vma(current->mm, start);
@@ -407,7 +409,14 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 
 		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
-		newflags = vm_flags;
+		/*
+		 * If this is a vanilla, non-pkey mprotect, inherit the
+		 * pkey from the VMA we are working on.
+		 */
+		if (pkey == -1)
+			newflags = calc_vm_prot_bits(prot, vma_pkey(vma));
+		else
+			newflags = calc_vm_prot_bits(prot, pkey);
 		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
@@ -443,3 +452,18 @@ out:
 	up_write(&current->mm->mmap_sem);
 	return error;
 }
+
+SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot)
+{
+	return do_mprotect_pkey(start, len, prot, -1);
+}
+
+SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot, int, pkey)
+{
+	if (!arch_validate_pkey(pkey))
+		return -EINVAL;
+
+	return do_mprotect_pkey(start, len, prot, pkey);
+}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 27/34] x86, pkeys: make mprotect_key() mask off additional vm_flags
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:15   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
Three of those bits: READ/WRITE/EXEC get translated directly in to
vma->vm_flags by calc_vm_prot_bits().  If a bit is unset in
mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
during the mprotect() call.

We do the by first calculating the VMA flags we want set, then
clearing the ones we do not want to inherit from the original VMA:

	vm_flags = calc_vm_prot_bits(prot, key);
	...
	newflags = vm_flags;
	newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));

However, we *also* want to mask off the original VMA's vm_flags in
which we store the protection key.

To do that, this patch adds a new macro:

	ARCH_VM_FLAGS_AFFECTED_BY_MPROTECT

which allows the architecture to specify additional bits that it would
like cleared.  We use that to ensure that the VM_PKEY_BIT* bits get
cleared.

This got missed in my testing because I was always going from a pkey=0
VMA to a nonzero one.  The current code works when we only set bits
but never clear them.  I've fixed this up in my testing.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/pkeys.h |    2 ++
 b/include/linux/pkeys.h        |    1 +
 b/mm/mprotect.c                |    9 ++++++++-
 3 files changed, 11 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pkeys.h~pkeys-mask-off-correct-vm_flags arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-mask-off-correct-vm_flags	2015-12-03 16:21:30.666899890 -0800
+++ b/arch/x86/include/asm/pkeys.h	2015-12-03 16:21:30.672900162 -0800
@@ -5,6 +5,8 @@
 				CONFIG_NR_PROTECTION_KEYS : 1)
 #define arch_validate_pkey(pkey) (((pkey) >= 0) && ((pkey) < arch_max_pkey()))
 
+#define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
+
 #endif /*_ASM_X86_PKEYS_H */
 
 
diff -puN include/linux/pkeys.h~pkeys-mask-off-correct-vm_flags include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-mask-off-correct-vm_flags	2015-12-03 16:21:30.667899935 -0800
+++ b/include/linux/pkeys.h	2015-12-03 16:21:30.672900162 -0800
@@ -7,6 +7,7 @@
 #include <asm/pkeys.h>
 #include <asm/mmu_context.h>
 #else /* ! CONFIG_ARCH_HAS_PKEYS */
+#define ARCH_VM_PKEY_FLAGS 0
 
 /*
  * This is called from mprotect_pkey().
diff -puN mm/mprotect.c~pkeys-mask-off-correct-vm_flags mm/mprotect.c
--- a/mm/mprotect.c~pkeys-mask-off-correct-vm_flags	2015-12-03 16:21:30.669900026 -0800
+++ b/mm/mprotect.c	2015-12-03 16:21:30.673900208 -0800
@@ -406,6 +406,13 @@ static int do_mprotect_pkey(unsigned lon
 
 	for (nstart = start ; ; ) {
 		unsigned long newflags;
+		/*
+		 * Each mprotect() call explicitly passes r/w/x permissions.
+		 * If a permission is not passed to mprotect(), it must be
+		 * cleared from the VMA.
+		 */
+		unsigned long mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC;
+		mask_off_old_flags |= ARCH_VM_PKEY_FLAGS;
 
 		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
@@ -417,7 +424,7 @@ static int do_mprotect_pkey(unsigned lon
 			newflags = calc_vm_prot_bits(prot, vma_pkey(vma));
 		else
 			newflags = calc_vm_prot_bits(prot, pkey);
-		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
+		newflags |= (vma->vm_flags & ~mask_off_old_flags);
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
 		if ((newflags & ~(newflags >> 4)) & (VM_READ | VM_WRITE | VM_EXEC)) {
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 27/34] x86, pkeys: make mprotect_key() mask off additional vm_flags
@ 2015-12-04  1:15   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
Three of those bits: READ/WRITE/EXEC get translated directly in to
vma->vm_flags by calc_vm_prot_bits().  If a bit is unset in
mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
during the mprotect() call.

We do the by first calculating the VMA flags we want set, then
clearing the ones we do not want to inherit from the original VMA:

	vm_flags = calc_vm_prot_bits(prot, key);
	...
	newflags = vm_flags;
	newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));

However, we *also* want to mask off the original VMA's vm_flags in
which we store the protection key.

To do that, this patch adds a new macro:

	ARCH_VM_FLAGS_AFFECTED_BY_MPROTECT

which allows the architecture to specify additional bits that it would
like cleared.  We use that to ensure that the VM_PKEY_BIT* bits get
cleared.

This got missed in my testing because I was always going from a pkey=0
VMA to a nonzero one.  The current code works when we only set bits
but never clear them.  I've fixed this up in my testing.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/pkeys.h |    2 ++
 b/include/linux/pkeys.h        |    1 +
 b/mm/mprotect.c                |    9 ++++++++-
 3 files changed, 11 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pkeys.h~pkeys-mask-off-correct-vm_flags arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-mask-off-correct-vm_flags	2015-12-03 16:21:30.666899890 -0800
+++ b/arch/x86/include/asm/pkeys.h	2015-12-03 16:21:30.672900162 -0800
@@ -5,6 +5,8 @@
 				CONFIG_NR_PROTECTION_KEYS : 1)
 #define arch_validate_pkey(pkey) (((pkey) >= 0) && ((pkey) < arch_max_pkey()))
 
+#define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
+
 #endif /*_ASM_X86_PKEYS_H */
 
 
diff -puN include/linux/pkeys.h~pkeys-mask-off-correct-vm_flags include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-mask-off-correct-vm_flags	2015-12-03 16:21:30.667899935 -0800
+++ b/include/linux/pkeys.h	2015-12-03 16:21:30.672900162 -0800
@@ -7,6 +7,7 @@
 #include <asm/pkeys.h>
 #include <asm/mmu_context.h>
 #else /* ! CONFIG_ARCH_HAS_PKEYS */
+#define ARCH_VM_PKEY_FLAGS 0
 
 /*
  * This is called from mprotect_pkey().
diff -puN mm/mprotect.c~pkeys-mask-off-correct-vm_flags mm/mprotect.c
--- a/mm/mprotect.c~pkeys-mask-off-correct-vm_flags	2015-12-03 16:21:30.669900026 -0800
+++ b/mm/mprotect.c	2015-12-03 16:21:30.673900208 -0800
@@ -406,6 +406,13 @@ static int do_mprotect_pkey(unsigned lon
 
 	for (nstart = start ; ; ) {
 		unsigned long newflags;
+		/*
+		 * Each mprotect() call explicitly passes r/w/x permissions.
+		 * If a permission is not passed to mprotect(), it must be
+		 * cleared from the VMA.
+		 */
+		unsigned long mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC;
+		mask_off_old_flags |= ARCH_VM_PKEY_FLAGS;
 
 		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
@@ -417,7 +424,7 @@ static int do_mprotect_pkey(unsigned lon
 			newflags = calc_vm_prot_bits(prot, vma_pkey(vma));
 		else
 			newflags = calc_vm_prot_bits(prot, pkey);
-		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
+		newflags |= (vma->vm_flags & ~mask_off_old_flags);
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
 		if ((newflags & ~(newflags >> 4)) & (VM_READ | VM_WRITE | VM_EXEC)) {
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 28/34] x86: wire up mprotect_key() system call
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:15   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

This is all that we need to get the new system call itself
working on x86.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 b/arch/x86/include/uapi/asm/mman.h       |    6 ++++++
 b/mm/Kconfig                             |    1 +
 4 files changed, 9 insertions(+)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.109919982 -0800
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2015-12-03 16:21:31.118920390 -0800
@@ -383,3 +383,4 @@
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
+377	i386	pkey_mprotect		sys_pkey_mprotect
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.111920072 -0800
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2015-12-03 16:21:31.118920390 -0800
@@ -332,6 +332,7 @@
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
+326	common	pkey_mprotect		sys_pkey_mprotect
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.113920163 -0800
+++ b/arch/x86/include/uapi/asm/mman.h	2015-12-03 16:21:31.118920390 -0800
@@ -20,6 +20,12 @@
 		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot, key) (		\
+		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
+		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
+		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
+		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
 #include <asm-generic/mman.h>
diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
--- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.114920208 -0800
+++ b/mm/Kconfig	2015-12-03 16:21:31.119920435 -0800
@@ -679,4 +679,5 @@ config NR_PROTECTION_KEYS
 	# Everything supports a _single_ key, so allow folks to
 	# at least call APIs that take keys, but require that the
 	# key be 0.
+	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
 	default 1
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 28/34] x86: wire up mprotect_key() system call
@ 2015-12-04  1:15   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

This is all that we need to get the new system call itself
working on x86.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 b/arch/x86/include/uapi/asm/mman.h       |    6 ++++++
 b/mm/Kconfig                             |    1 +
 4 files changed, 9 insertions(+)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.109919982 -0800
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2015-12-03 16:21:31.118920390 -0800
@@ -383,3 +383,4 @@
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
+377	i386	pkey_mprotect		sys_pkey_mprotect
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.111920072 -0800
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2015-12-03 16:21:31.118920390 -0800
@@ -332,6 +332,7 @@
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
+326	common	pkey_mprotect		sys_pkey_mprotect
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.113920163 -0800
+++ b/arch/x86/include/uapi/asm/mman.h	2015-12-03 16:21:31.118920390 -0800
@@ -20,6 +20,12 @@
 		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot, key) (		\
+		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
+		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
+		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
+		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
 #include <asm-generic/mman.h>
diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
--- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.114920208 -0800
+++ b/mm/Kconfig	2015-12-03 16:21:31.119920435 -0800
@@ -679,4 +679,5 @@ config NR_PROTECTION_KEYS
 	# Everything supports a _single_ key, so allow folks to
 	# at least call APIs that take keys, but require that the
 	# key be 0.
+	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
 	default 1
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 29/34] x86: separate out LDT init from context init
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:15   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The arch-specific mm_context_t is a great place to put
protection-key allocation state.

But, we need to initialize the allocation state because pkey 0 is
always "allocated".  All of the runtime initialization of
mm_context_t is done in *_ldt() manipulation functions.  This
renames the existing LDT functions like this:

	init_new_context() -> init_new_context_ldt()
	destroy_context() -> destroy_context_ldt()

and makes init_new_context() and destroy_context() available for
generic use.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/mmu_context.h |   21 ++++++++++++++++-----
 b/arch/x86/kernel/ldt.c              |    4 ++--
 2 files changed, 18 insertions(+), 7 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~init-ldt-extricate arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~init-ldt-extricate	2015-12-03 16:21:31.585941570 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:31.590941797 -0800
@@ -53,15 +53,15 @@ struct ldt_struct {
 /*
  * Used for LDT copy/destruction.
  */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
-void destroy_context(struct mm_struct *mm);
+int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
+void destroy_context_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
-static inline int init_new_context(struct task_struct *tsk,
-				   struct mm_struct *mm)
+static inline int init_new_context_ldt(struct task_struct *tsk,
+				       struct mm_struct *mm)
 {
 	return 0;
 }
-static inline void destroy_context(struct mm_struct *mm) {}
+static inline void destroy_context_ldt(struct mm_struct *mm) {}
 #endif
 
 static inline void load_mm_ldt(struct mm_struct *mm)
@@ -105,6 +105,17 @@ static inline void enter_lazy_tlb(struct
 #endif
 }
 
+static inline int init_new_context(struct task_struct *tsk,
+				   struct mm_struct *mm)
+{
+	init_new_context_ldt(tsk, mm);
+	return 0;
+}
+static inline void destroy_context(struct mm_struct *mm)
+{
+	destroy_context_ldt(mm);
+}
+
 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			     struct task_struct *tsk)
 {
diff -puN arch/x86/kernel/ldt.c~init-ldt-extricate arch/x86/kernel/ldt.c
--- a/arch/x86/kernel/ldt.c~init-ldt-extricate	2015-12-03 16:21:31.587941660 -0800
+++ b/arch/x86/kernel/ldt.c	2015-12-03 16:21:31.590941797 -0800
@@ -103,7 +103,7 @@ static void free_ldt_struct(struct ldt_s
  * we do not have to muck with descriptors here, that is
  * done in switch_mm() as needed.
  */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
+int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct ldt_struct *new_ldt;
 	struct mm_struct *old_mm;
@@ -144,7 +144,7 @@ out_unlock:
  *
  * 64bit: Don't touch the LDT register - we're already in the next thread.
  */
-void destroy_context(struct mm_struct *mm)
+void destroy_context_ldt(struct mm_struct *mm)
 {
 	free_ldt_struct(mm->context.ldt);
 	mm->context.ldt = NULL;
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 29/34] x86: separate out LDT init from context init
@ 2015-12-04  1:15   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The arch-specific mm_context_t is a great place to put
protection-key allocation state.

But, we need to initialize the allocation state because pkey 0 is
always "allocated".  All of the runtime initialization of
mm_context_t is done in *_ldt() manipulation functions.  This
renames the existing LDT functions like this:

	init_new_context() -> init_new_context_ldt()
	destroy_context() -> destroy_context_ldt()

and makes init_new_context() and destroy_context() available for
generic use.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/mmu_context.h |   21 ++++++++++++++++-----
 b/arch/x86/kernel/ldt.c              |    4 ++--
 2 files changed, 18 insertions(+), 7 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~init-ldt-extricate arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~init-ldt-extricate	2015-12-03 16:21:31.585941570 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:31.590941797 -0800
@@ -53,15 +53,15 @@ struct ldt_struct {
 /*
  * Used for LDT copy/destruction.
  */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
-void destroy_context(struct mm_struct *mm);
+int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
+void destroy_context_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
-static inline int init_new_context(struct task_struct *tsk,
-				   struct mm_struct *mm)
+static inline int init_new_context_ldt(struct task_struct *tsk,
+				       struct mm_struct *mm)
 {
 	return 0;
 }
-static inline void destroy_context(struct mm_struct *mm) {}
+static inline void destroy_context_ldt(struct mm_struct *mm) {}
 #endif
 
 static inline void load_mm_ldt(struct mm_struct *mm)
@@ -105,6 +105,17 @@ static inline void enter_lazy_tlb(struct
 #endif
 }
 
+static inline int init_new_context(struct task_struct *tsk,
+				   struct mm_struct *mm)
+{
+	init_new_context_ldt(tsk, mm);
+	return 0;
+}
+static inline void destroy_context(struct mm_struct *mm)
+{
+	destroy_context_ldt(mm);
+}
+
 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			     struct task_struct *tsk)
 {
diff -puN arch/x86/kernel/ldt.c~init-ldt-extricate arch/x86/kernel/ldt.c
--- a/arch/x86/kernel/ldt.c~init-ldt-extricate	2015-12-03 16:21:31.587941660 -0800
+++ b/arch/x86/kernel/ldt.c	2015-12-03 16:21:31.590941797 -0800
@@ -103,7 +103,7 @@ static void free_ldt_struct(struct ldt_s
  * we do not have to muck with descriptors here, that is
  * done in switch_mm() as needed.
  */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
+int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct ldt_struct *new_ldt;
 	struct mm_struct *old_mm;
@@ -144,7 +144,7 @@ out_unlock:
  *
  * 64bit: Don't touch the LDT register - we're already in the next thread.
  */
-void destroy_context(struct mm_struct *mm)
+void destroy_context_ldt(struct mm_struct *mm)
 {
 	free_ldt_struct(mm->context.ldt);
 	mm->context.ldt = NULL;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 30/34] x86, fpu: allow setting of XSAVE state
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:15   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We want to modify the Protection Key rights inside the kernel, so
we need to change PKRU's contents.  But, if we do a plain
'wrpkru', when we return to userspace we might do an XRSTOR and
wipe out the kernel's 'wrpkru'.  So, we need to go after PKRU in
the xsave buffer.

We do this by:
1. Ensuring that we have the XSAVE registers (fpregs) in the
   kernel FPU buffer (fpstate)
2. Looking up the location of a given state in the buffer
3. Filling in the stat
4. Ensuring that the hardware knows that state is present there
   (basically that the 'init optimization' is not in place).
5. Copying the newly-modified state back to the registers if
   necessary.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/fpu/internal.h |    2 
 b/arch/x86/kernel/fpu/core.c          |   63 +++++++++++++++++++++
 b/arch/x86/kernel/fpu/xstate.c        |  100 +++++++++++++++++++++++++++++++++-
 3 files changed, 163 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/internal.h~pkey-xsave-set arch/x86/include/asm/fpu/internal.h
--- a/arch/x86/include/asm/fpu/internal.h~pkey-xsave-set	2015-12-03 16:21:32.016961117 -0800
+++ b/arch/x86/include/asm/fpu/internal.h	2015-12-03 16:21:32.022961389 -0800
@@ -24,6 +24,8 @@
 extern void fpu__activate_curr(struct fpu *fpu);
 extern void fpu__activate_fpstate_read(struct fpu *fpu);
 extern void fpu__activate_fpstate_write(struct fpu *fpu);
+extern void fpu__current_fpstate_write_begin(void);
+extern void fpu__current_fpstate_write_end(void);
 extern void fpu__save(struct fpu *fpu);
 extern void fpu__restore(struct fpu *fpu);
 extern int  fpu__restore_sig(void __user *buf, int ia32_frame);
diff -puN arch/x86/kernel/fpu/core.c~pkey-xsave-set arch/x86/kernel/fpu/core.c
--- a/arch/x86/kernel/fpu/core.c~pkey-xsave-set	2015-12-03 16:21:32.017961162 -0800
+++ b/arch/x86/kernel/fpu/core.c	2015-12-03 16:21:32.023961434 -0800
@@ -352,6 +352,69 @@ void fpu__activate_fpstate_write(struct
 }
 
 /*
+ * This function must be called before we write the current
+ * task's fpstate.
+ *
+ * This call gets the current FPU register state and moves
+ * it in to the 'fpstate'.  Preemption is disabled so that
+ * no writes to the 'fpstate' can occur from context
+ * swiches.
+ *
+ * Must be followed by a fpu__current_fpstate_write_end().
+ */
+void fpu__current_fpstate_write_begin(void)
+{
+	struct fpu *fpu = &current->thread.fpu;
+
+	/*
+	 * Ensure that the context-switching code does not write
+	 * over the fpstate while we are doing our update.
+	 */
+	preempt_disable();
+
+	/*
+	 * Move the fpregs in to the fpu's 'fpstate'.
+	 */
+	fpu__activate_fpstate_read(fpu);
+
+	/*
+	 * The caller is about to write to 'fpu'.  Ensure that no
+	 * CPU thinks that its fpregs match the fpstate.  This
+	 * ensures we will not be lazy and skip a XRSTOR in the
+	 * future.
+	 */
+	fpu->last_cpu = -1;
+}
+
+/*
+ * This function must be paired with fpu__current_fpstate_write_begin()
+ *
+ * This will ensure that the modified fpstate gets placed back in
+ * the fpregs if necessary.
+ *
+ * Note: This function may be called whether or not an _actual_
+ * write to the fpstate occurred.
+ */
+void fpu__current_fpstate_write_end(void)
+{
+	struct fpu *fpu = &current->thread.fpu;
+
+	/*
+	 * 'fpu' now has an updated copy of the state, but the
+	 * registers may still be out of date.  Update them with
+	 * an XRSTOR if they are active.
+	 */
+	if (fpregs_active())
+		copy_kernel_to_fpregs(&fpu->state);
+
+	/*
+	 * Our update is done and the fpregs/fpstate are in sync
+	 * if necessary.  Context switches can happen again.
+	 */
+	preempt_enable();
+}
+
+/*
  * 'fpu__restore()' is called to copy FPU registers from
  * the FPU fpstate to the live hw registers and to activate
  * access to the hardware registers, so that FPU instructions
diff -puN arch/x86/kernel/fpu/xstate.c~pkey-xsave-set arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkey-xsave-set	2015-12-03 16:21:32.019961253 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2015-12-03 16:21:32.023961434 -0800
@@ -679,6 +679,19 @@ void fpu__resume_cpu(void)
 }
 
 /*
+ * Given an xstate feature mask, calculate where in the xsave
+ * buffer the state is.  Callers should ensure that the buffer
+ * is valid.
+ *
+ * Note: does not work for compacted buffers.
+ */
+void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
+{
+	int feature_nr = fls64(xstate_feature_mask) - 1;
+
+	return (void *)xsave + xstate_comp_offsets[feature_nr];
+}
+/*
  * Given the xsave area and a state inside, this function returns the
  * address of the state.
  *
@@ -698,7 +711,6 @@ void fpu__resume_cpu(void)
  */
 void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
 {
-	int feature_nr = fls64(xstate_feature) - 1;
 	/*
 	 * Do we even *have* xsave state?
 	 */
@@ -727,7 +739,7 @@ void *get_xsave_addr(struct xregs_state
 	if (!(xsave->header.xfeatures & xstate_feature))
 		return NULL;
 
-	return (void *)xsave + xstate_comp_offsets[feature_nr];
+	return __raw_xsave_addr(xsave, xstate_feature);
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
 
@@ -762,3 +774,87 @@ const void *get_xsave_field_ptr(int xsav
 
 	return get_xsave_addr(&fpu->state.xsave, xsave_state);
 }
+
+
+/*
+ * Set xfeatures (aka XSTATE_BV) bit for a feature that we want
+ * to take out of its "init state".  This will ensure that an
+ * XRSTOR actually restores the state.
+ */
+static void fpu__xfeature_set_non_init(struct xregs_state *xsave,
+		int xstate_feature_mask)
+{
+	xsave->header.xfeatures |= xstate_feature_mask;
+}
+
+/*
+ * This function is safe to call whether the FPU is in use or not.
+ *
+ * Note that this only works on the current task.
+ *
+ * Inputs:
+ *	@xsave_state: state which is defined in xsave.h (e.g. XFEATURE_MASK_FP,
+ *	XFEATURE_MASK_SSE, etc...)
+ *	@xsave_state_ptr: a pointer to a copy of the state that you would
+ *	like written in to the current task's FPU xsave state.  This pointer
+ *	must not be located in the current tasks's xsave area.
+ * Output:
+ *	address of the state in the xsave area or NULL if the state
+ *	is not present or is in its 'init state'.
+ */
+static void fpu__xfeature_set_state(int xstate_feature_mask,
+		void *xstate_feature_src, size_t len)
+{
+	struct xregs_state *xsave = &current->thread.fpu.state.xsave;
+	struct fpu *fpu = &current->thread.fpu;
+	void *dst;
+
+	if (!boot_cpu_has(X86_FEATURE_XSAVE)) {
+		WARN_ONCE(1, "%s() attempted with no xsave support", __func__);
+		return;
+	}
+
+	/*
+	 * Tell the FPU code that we need the FPU state to be in
+	 * 'fpu' (not in the registers), and that we need it to
+	 * be stable while we write to it.
+	 */
+	fpu__current_fpstate_write_begin();
+
+	/*
+	 * This method *WILL* *NOT* work for compact-format
+	 * buffers.  If the 'xstate_feature_mask' is unset in
+	 * xcomp_bv then we may need to move other feature state
+	 * "up" in the buffer.
+	 */
+	if (xsave->header.xcomp_bv & xstate_feature_mask) {
+		WARN_ON_ONCE(1);
+		goto out;
+	}
+
+	/* find the location in the xsave buffer of the desired state */
+	dst = __raw_xsave_addr(&fpu->state.xsave, xstate_feature_mask);
+
+	/*
+	 * Make sure that the pointer being passed in did not
+	 * come from the xsave buffer itself.
+	 */
+	WARN_ONCE(xstate_feature_src == dst, "set from xsave buffer itself");
+
+	/* put the caller-provided data in the location */
+	memcpy(dst, xstate_feature_src, len);
+
+	/*
+	 * Mark the xfeature so that the CPU knows there is state
+	 * in the buffer now.
+	 */
+	fpu__xfeature_set_non_init(xsave, xstate_feature_mask);
+out:
+	/*
+	 * We are done writing to the 'fpu'.  Reenable preeption
+	 * and (possibly) move the fpstate back in to the fpregs.
+	 */
+	fpu__current_fpstate_write_end();
+
+	return 0;
+}
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 30/34] x86, fpu: allow setting of XSAVE state
@ 2015-12-04  1:15   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We want to modify the Protection Key rights inside the kernel, so
we need to change PKRU's contents.  But, if we do a plain
'wrpkru', when we return to userspace we might do an XRSTOR and
wipe out the kernel's 'wrpkru'.  So, we need to go after PKRU in
the xsave buffer.

We do this by:
1. Ensuring that we have the XSAVE registers (fpregs) in the
   kernel FPU buffer (fpstate)
2. Looking up the location of a given state in the buffer
3. Filling in the stat
4. Ensuring that the hardware knows that state is present there
   (basically that the 'init optimization' is not in place).
5. Copying the newly-modified state back to the registers if
   necessary.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/fpu/internal.h |    2 
 b/arch/x86/kernel/fpu/core.c          |   63 +++++++++++++++++++++
 b/arch/x86/kernel/fpu/xstate.c        |  100 +++++++++++++++++++++++++++++++++-
 3 files changed, 163 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/internal.h~pkey-xsave-set arch/x86/include/asm/fpu/internal.h
--- a/arch/x86/include/asm/fpu/internal.h~pkey-xsave-set	2015-12-03 16:21:32.016961117 -0800
+++ b/arch/x86/include/asm/fpu/internal.h	2015-12-03 16:21:32.022961389 -0800
@@ -24,6 +24,8 @@
 extern void fpu__activate_curr(struct fpu *fpu);
 extern void fpu__activate_fpstate_read(struct fpu *fpu);
 extern void fpu__activate_fpstate_write(struct fpu *fpu);
+extern void fpu__current_fpstate_write_begin(void);
+extern void fpu__current_fpstate_write_end(void);
 extern void fpu__save(struct fpu *fpu);
 extern void fpu__restore(struct fpu *fpu);
 extern int  fpu__restore_sig(void __user *buf, int ia32_frame);
diff -puN arch/x86/kernel/fpu/core.c~pkey-xsave-set arch/x86/kernel/fpu/core.c
--- a/arch/x86/kernel/fpu/core.c~pkey-xsave-set	2015-12-03 16:21:32.017961162 -0800
+++ b/arch/x86/kernel/fpu/core.c	2015-12-03 16:21:32.023961434 -0800
@@ -352,6 +352,69 @@ void fpu__activate_fpstate_write(struct
 }
 
 /*
+ * This function must be called before we write the current
+ * task's fpstate.
+ *
+ * This call gets the current FPU register state and moves
+ * it in to the 'fpstate'.  Preemption is disabled so that
+ * no writes to the 'fpstate' can occur from context
+ * swiches.
+ *
+ * Must be followed by a fpu__current_fpstate_write_end().
+ */
+void fpu__current_fpstate_write_begin(void)
+{
+	struct fpu *fpu = &current->thread.fpu;
+
+	/*
+	 * Ensure that the context-switching code does not write
+	 * over the fpstate while we are doing our update.
+	 */
+	preempt_disable();
+
+	/*
+	 * Move the fpregs in to the fpu's 'fpstate'.
+	 */
+	fpu__activate_fpstate_read(fpu);
+
+	/*
+	 * The caller is about to write to 'fpu'.  Ensure that no
+	 * CPU thinks that its fpregs match the fpstate.  This
+	 * ensures we will not be lazy and skip a XRSTOR in the
+	 * future.
+	 */
+	fpu->last_cpu = -1;
+}
+
+/*
+ * This function must be paired with fpu__current_fpstate_write_begin()
+ *
+ * This will ensure that the modified fpstate gets placed back in
+ * the fpregs if necessary.
+ *
+ * Note: This function may be called whether or not an _actual_
+ * write to the fpstate occurred.
+ */
+void fpu__current_fpstate_write_end(void)
+{
+	struct fpu *fpu = &current->thread.fpu;
+
+	/*
+	 * 'fpu' now has an updated copy of the state, but the
+	 * registers may still be out of date.  Update them with
+	 * an XRSTOR if they are active.
+	 */
+	if (fpregs_active())
+		copy_kernel_to_fpregs(&fpu->state);
+
+	/*
+	 * Our update is done and the fpregs/fpstate are in sync
+	 * if necessary.  Context switches can happen again.
+	 */
+	preempt_enable();
+}
+
+/*
  * 'fpu__restore()' is called to copy FPU registers from
  * the FPU fpstate to the live hw registers and to activate
  * access to the hardware registers, so that FPU instructions
diff -puN arch/x86/kernel/fpu/xstate.c~pkey-xsave-set arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkey-xsave-set	2015-12-03 16:21:32.019961253 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2015-12-03 16:21:32.023961434 -0800
@@ -679,6 +679,19 @@ void fpu__resume_cpu(void)
 }
 
 /*
+ * Given an xstate feature mask, calculate where in the xsave
+ * buffer the state is.  Callers should ensure that the buffer
+ * is valid.
+ *
+ * Note: does not work for compacted buffers.
+ */
+void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
+{
+	int feature_nr = fls64(xstate_feature_mask) - 1;
+
+	return (void *)xsave + xstate_comp_offsets[feature_nr];
+}
+/*
  * Given the xsave area and a state inside, this function returns the
  * address of the state.
  *
@@ -698,7 +711,6 @@ void fpu__resume_cpu(void)
  */
 void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
 {
-	int feature_nr = fls64(xstate_feature) - 1;
 	/*
 	 * Do we even *have* xsave state?
 	 */
@@ -727,7 +739,7 @@ void *get_xsave_addr(struct xregs_state
 	if (!(xsave->header.xfeatures & xstate_feature))
 		return NULL;
 
-	return (void *)xsave + xstate_comp_offsets[feature_nr];
+	return __raw_xsave_addr(xsave, xstate_feature);
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
 
@@ -762,3 +774,87 @@ const void *get_xsave_field_ptr(int xsav
 
 	return get_xsave_addr(&fpu->state.xsave, xsave_state);
 }
+
+
+/*
+ * Set xfeatures (aka XSTATE_BV) bit for a feature that we want
+ * to take out of its "init state".  This will ensure that an
+ * XRSTOR actually restores the state.
+ */
+static void fpu__xfeature_set_non_init(struct xregs_state *xsave,
+		int xstate_feature_mask)
+{
+	xsave->header.xfeatures |= xstate_feature_mask;
+}
+
+/*
+ * This function is safe to call whether the FPU is in use or not.
+ *
+ * Note that this only works on the current task.
+ *
+ * Inputs:
+ *	@xsave_state: state which is defined in xsave.h (e.g. XFEATURE_MASK_FP,
+ *	XFEATURE_MASK_SSE, etc...)
+ *	@xsave_state_ptr: a pointer to a copy of the state that you would
+ *	like written in to the current task's FPU xsave state.  This pointer
+ *	must not be located in the current tasks's xsave area.
+ * Output:
+ *	address of the state in the xsave area or NULL if the state
+ *	is not present or is in its 'init state'.
+ */
+static void fpu__xfeature_set_state(int xstate_feature_mask,
+		void *xstate_feature_src, size_t len)
+{
+	struct xregs_state *xsave = &current->thread.fpu.state.xsave;
+	struct fpu *fpu = &current->thread.fpu;
+	void *dst;
+
+	if (!boot_cpu_has(X86_FEATURE_XSAVE)) {
+		WARN_ONCE(1, "%s() attempted with no xsave support", __func__);
+		return;
+	}
+
+	/*
+	 * Tell the FPU code that we need the FPU state to be in
+	 * 'fpu' (not in the registers), and that we need it to
+	 * be stable while we write to it.
+	 */
+	fpu__current_fpstate_write_begin();
+
+	/*
+	 * This method *WILL* *NOT* work for compact-format
+	 * buffers.  If the 'xstate_feature_mask' is unset in
+	 * xcomp_bv then we may need to move other feature state
+	 * "up" in the buffer.
+	 */
+	if (xsave->header.xcomp_bv & xstate_feature_mask) {
+		WARN_ON_ONCE(1);
+		goto out;
+	}
+
+	/* find the location in the xsave buffer of the desired state */
+	dst = __raw_xsave_addr(&fpu->state.xsave, xstate_feature_mask);
+
+	/*
+	 * Make sure that the pointer being passed in did not
+	 * come from the xsave buffer itself.
+	 */
+	WARN_ONCE(xstate_feature_src == dst, "set from xsave buffer itself");
+
+	/* put the caller-provided data in the location */
+	memcpy(dst, xstate_feature_src, len);
+
+	/*
+	 * Mark the xfeature so that the CPU knows there is state
+	 * in the buffer now.
+	 */
+	fpu__xfeature_set_non_init(xsave, xstate_feature_mask);
+out:
+	/*
+	 * We are done writing to the 'fpu'.  Reenable preeption
+	 * and (possibly) move the fpstate back in to the fpregs.
+	 */
+	fpu__current_fpstate_write_end();
+
+	return 0;
+}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 31/34] x86, pkeys: allocation/free syscalls
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:15   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

This patch adds two new system calls:

	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
	int pkey_free(int pkey);

These establish which protection keys are valid for use by
userspace.  A key which was not obtained by pkey_alloc() may not
be passed to pkey_mprotect().

In addition, the 'init_access_rights' argument to pkey_alloc() specifies
the rights that will be established for the returned pkey.  For instance

	pkey = pkey_alloc(flags, PKEY_DENY_WRITE);

will return with the bits set in PKRU such that writing to 'pkey' is
already denied.  This keeps userspace from needing to have knowledge
about manipulating PKRU.  It is still free to do so if it wishes, but
it is no longer required.

The kernel does _not_ enforce that this interface must be used for
changes to PKRU, even for keys it does not control.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    2 
 b/arch/x86/entry/syscalls/syscall_64.tbl |    2 
 b/arch/x86/include/asm/mmu.h             |    7 ++
 b/arch/x86/include/asm/mmu_context.h     |    8 +++
 b/arch/x86/include/asm/pgtable.h         |    5 +-
 b/arch/x86/include/asm/pkeys.h           |   55 ++++++++++++++++++++++
 b/arch/x86/kernel/fpu/xstate.c           |   75 +++++++++++++++++++++++++++++++
 b/include/linux/pkeys.h                  |   23 +++++++++
 b/include/uapi/asm-generic/mman-common.h |    5 ++
 b/mm/mprotect.c                          |   59 +++++++++++++++++++++++-
 10 files changed, 238 insertions(+), 3 deletions(-)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkey-allocation-syscalls arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkey-allocation-syscalls	2015-12-03 16:21:32.484982342 -0800
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2015-12-03 16:21:32.502983159 -0800
@@ -384,3 +384,5 @@
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
 377	i386	pkey_mprotect		sys_pkey_mprotect
+378	i386	pkey_alloc		sys_pkey_alloc
+379	i386	pkey_free		sys_pkey_free
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkey-allocation-syscalls arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkey-allocation-syscalls	2015-12-03 16:21:32.485982388 -0800
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2015-12-03 16:21:32.502983159 -0800
@@ -333,6 +333,8 @@
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
 326	common	pkey_mprotect		sys_pkey_mprotect
+327	common	pkey_alloc		sys_pkey_alloc
+328	common	pkey_free		sys_pkey_free
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/asm/mmu_context.h~pkey-allocation-syscalls arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkey-allocation-syscalls	2015-12-03 16:21:32.487982478 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:32.503983204 -0800
@@ -108,7 +108,12 @@ static inline void enter_lazy_tlb(struct
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/* pkey 0 is the default and always allocated */
+	mm->context.pkey_allocation_map = 0x1;
+#endif
 	init_new_context_ldt(tsk, mm);
+
 	return 0;
 }
 static inline void destroy_context(struct mm_struct *mm)
@@ -333,4 +338,7 @@ static inline bool arch_pte_access_permi
 	return __pkru_allows_pkey(pte_flags_pkey(pte_flags(pte)), write);
 }
 
+extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val);
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/mmu.h~pkey-allocation-syscalls arch/x86/include/asm/mmu.h
--- a/arch/x86/include/asm/mmu.h~pkey-allocation-syscalls	2015-12-03 16:21:32.489982569 -0800
+++ b/arch/x86/include/asm/mmu.h	2015-12-03 16:21:32.503983204 -0800
@@ -22,6 +22,13 @@ typedef struct {
 	void __user *vdso;
 
 	atomic_t perf_rdpmc_allowed;	/* nonzero if rdpmc is allowed */
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/*
+	 * One bit per protection key says whether userspace can
+	 * use it or not.  protected by mmap_sem.
+	 */
+	u16 pkey_allocation_map;
+#endif
 } mm_context_t;
 
 #ifdef CONFIG_SMP
diff -puN arch/x86/include/asm/pgtable.h~pkey-allocation-syscalls arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkey-allocation-syscalls	2015-12-03 16:21:32.490982614 -0800
+++ b/arch/x86/include/asm/pgtable.h	2015-12-03 16:21:32.503983204 -0800
@@ -912,16 +912,17 @@ static inline pte_t pte_swp_clear_soft_d
 
 #define PKRU_AD_BIT 0x1
 #define PKRU_WD_BIT 0x2
+#define PKRU_BITS_PER_PKEY 2
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * 2;
+	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
 	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * 2;
+	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
 	/*
 	 * Access-disable disables writes too so we need to check
 	 * both bits here.
diff -puN arch/x86/include/asm/pkeys.h~pkey-allocation-syscalls arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkey-allocation-syscalls	2015-12-03 16:21:32.492982705 -0800
+++ b/arch/x86/include/asm/pkeys.h	2015-12-03 16:21:32.504983249 -0800
@@ -7,6 +7,61 @@
 
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
 
+#define mm_pkey_allocation_map(mm)	(mm->context.pkey_allocation_map)
+#define mm_set_pkey_allocated(mm, pkey) do {		\
+	mm_pkey_allocation_map(mm) |= (1 << pkey);	\
+} while (0)
+#define mm_set_pkey_free(mm, pkey) do {			\
+	mm_pkey_allocation_map(mm) &= ~(1 << pkey);	\
+} while (0)
+
+static inline
+bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey)
+{
+	if (!arch_validate_pkey(pkey))
+		return true;
+
+	return mm_pkey_allocation_map(mm) & (1 << pkey);
+}
+
+static inline
+int mm_pkey_alloc(struct mm_struct *mm)
+{
+	int all_pkeys_mask = ((1 << arch_max_pkey()) - 1);
+	int ret;
+
+	/*
+	 * Are we out of pkeys?  We must handle this specially
+	 * because ffz() behavior is undefined if there are no
+	 * zeros.
+	 */
+	if (mm_pkey_allocation_map(mm) == all_pkeys_mask)
+		return -1;
+
+	ret = ffz(mm_pkey_allocation_map(mm));
+
+	mm_set_pkey_allocated(mm, ret);
+
+	return ret;
+}
+
+static inline
+int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+	/*
+	 * pkey 0 is special, always allocated and can never
+	 * be freed.
+	 */
+	if (!pkey || !arch_validate_pkey(pkey))
+		return -EINVAL;
+	if (!mm_pkey_is_allocated(mm, pkey))
+		return -EINVAL;
+
+	mm_set_pkey_free(mm, pkey);
+
+	return 0;
+}
+
 #endif /*_ASM_X86_PKEYS_H */
 
 
diff -puN arch/x86/kernel/fpu/xstate.c~pkey-allocation-syscalls arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkey-allocation-syscalls	2015-12-03 16:21:32.494982796 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2015-12-03 16:21:32.504983249 -0800
@@ -5,6 +5,8 @@
  */
 #include <linux/compat.h>
 #include <linux/cpu.h>
+#include <linux/mman.h>
+#include <linux/pkeys.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -775,6 +777,7 @@ const void *get_xsave_field_ptr(int xsav
 	return get_xsave_addr(&fpu->state.xsave, xsave_state);
 }
 
+#ifdef CONFIG_ARCH_HAS_PKEYS
 
 /*
  * Set xfeatures (aka XSTATE_BV) bit for a feature that we want
@@ -855,6 +858,78 @@ out:
 	 * and (possibly) move the fpstate back in to the fpregs.
 	 */
 	fpu__current_fpstate_write_end();
+}
+
+#define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2)
+#define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1)
+
+/*
+ * This will go out and modify the XSAVE buffer so that PKRU is
+ * set to a particular state for access to 'pkey'.
+ *
+ * PKRU state does affect kernel access to user memory.  We do
+ * not modfiy PKRU *itself* here, only the XSAVE state that will
+ * be restored in to PKRU when we return back to userspace.
+ */
+int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val)
+{
+	struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+	struct pkru_state *old_pkru_state;
+	struct pkru_state new_pkru_state;
+	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+	u32 new_pkru_bits = 0;
+
+	if (!arch_validate_pkey(pkey))
+		return -EINVAL;
+	/*
+	 * This check implies XSAVE support.  OSPKE only gets
+	 * set if we enable XSAVE and we enable PKU in XCR0.
+	 */
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return -EINVAL;
+
+	/* Set the bits we need in PKRU  */
+	if (init_val & PKEY_DISABLE_ACCESS)
+		new_pkru_bits |= PKRU_AD_BIT;
+	if (init_val & PKEY_DISABLE_WRITE)
+		new_pkru_bits |= PKRU_WD_BIT;
+
+	/* Shift the bits in to the correct place in PKRU for pkey. */
+	new_pkru_bits <<= pkey_shift;
+
+	/* Locate old copy of the state in the xsave buffer */
+	old_pkru_state = get_xsave_addr(xsave, XFEATURE_MASK_PKRU);
+
+	/*
+	 * When state is not in the buffer, it is in the init
+	 * state, set it manually.  Otherwise, copy out the old
+	 * state.
+	 */
+	if (!old_pkru_state)
+		new_pkru_state.pkru = 0;
+	else
+		new_pkru_state.pkru = old_pkru_state->pkru;
+
+	/* mask off any old bits in place */
+	new_pkru_state.pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
+	/* Set the newly-requested bits */
+	new_pkru_state.pkru |= new_pkru_bits;
+
+	/*
+	 * We could theoretically live without zeroing pkru.pad.
+	 * The current XSAVE feature state definition says that
+	 * only bytes 0->3 are used.  But we do not want to
+	 * chance leaking kernel stack out to userspace in case a
+	 * memcpy() of the whole xsave buffer was done.
+	 *
+	 * They're in the same cacheline anyway.
+	 */
+	new_pkru_state.pad = 0;
+
+	fpu__xfeature_set_state(XFEATURE_MASK_PKRU, &new_pkru_state,
+			sizeof(new_pkru_state));
 
 	return 0;
 }
+#endif /* CONFIG_ARCH_HAS_PKEYS */
diff -puN include/linux/pkeys.h~pkey-allocation-syscalls include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkey-allocation-syscalls	2015-12-03 16:21:32.495982841 -0800
+++ b/include/linux/pkeys.h	2015-12-03 16:21:32.504983249 -0800
@@ -23,6 +23,29 @@ static inline int vma_pkey(struct vm_are
 {
 	return 0;
 }
+
+static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
+{
+	return (pkey == 0);
+}
+
+static inline int mm_pkey_alloc(struct mm_struct *mm)
+{
+	return -1;
+}
+
+static inline int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+	WARN_ONCE(1, "free of protection key when disabled");
+	return -EINVAL;
+}
+
+static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+			unsigned long init_val)
+{
+	return 0;
+}
+
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
diff -puN include/uapi/asm-generic/mman-common.h~pkey-allocation-syscalls include/uapi/asm-generic/mman-common.h
--- a/include/uapi/asm-generic/mman-common.h~pkey-allocation-syscalls	2015-12-03 16:21:32.497982932 -0800
+++ b/include/uapi/asm-generic/mman-common.h	2015-12-03 16:21:32.505983295 -0800
@@ -71,4 +71,9 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define PKEY_DISABLE_ACCESS	0x1
+#define PKEY_DISABLE_WRITE	0x2
+#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
+				 PKEY_DISABLE_WRITE)
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff -puN mm/mprotect.c~pkey-allocation-syscalls mm/mprotect.c
--- a/mm/mprotect.c~pkey-allocation-syscalls	2015-12-03 16:21:32.498982977 -0800
+++ b/mm/mprotect.c	2015-12-03 16:21:32.505983295 -0800
@@ -23,11 +23,13 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
+#include <linux/pkeys.h>
 #include <linux/ksm.h>
 #include <linux/pkeys.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
+#include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
 
 #include "internal.h"
@@ -355,6 +357,8 @@ static int do_mprotect_pkey(unsigned lon
 	struct vm_area_struct *vma, *prev;
 	int error = -EINVAL;
 	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
+	int plain_mprotect = (pkey == -1);
+
 	prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP);
 	if (grows == (PROT_GROWSDOWN|PROT_GROWSUP)) /* can't be both */
 		return -EINVAL;
@@ -379,6 +383,14 @@ static int do_mprotect_pkey(unsigned lon
 
 	down_write(&current->mm->mmap_sem);
 
+	/*
+	 * If userspace did not allocate the pkey, do not let
+	 * them use it here.
+	 */
+	error = -EINVAL;
+	if (!plain_mprotect && !mm_pkey_is_allocated(current->mm, pkey))
+		goto out;
+
 	vma = find_vma(current->mm, start);
 	error = -ENOMEM;
 	if (!vma)
@@ -420,7 +432,7 @@ static int do_mprotect_pkey(unsigned lon
 		 * If this is a vanilla, non-pkey mprotect, inherit the
 		 * pkey from the VMA we are working on.
 		 */
-		if (pkey == -1)
+		if (plain_mprotect)
 			newflags = calc_vm_prot_bits(prot, vma_pkey(vma));
 		else
 			newflags = calc_vm_prot_bits(prot, pkey);
@@ -474,3 +486,48 @@ SYSCALL_DEFINE4(pkey_mprotect, unsigned
 
 	return do_mprotect_pkey(start, len, prot, pkey);
 }
+
+SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
+{
+	int pkey;
+	int ret;
+
+	/* No flags supported yet. */
+	if (flags)
+		return -EINVAL;
+	/* check for unsupported init values */
+	if (init_val & ~PKEY_ACCESS_MASK)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	pkey = mm_pkey_alloc(current->mm);
+
+	ret = -ENOSPC;
+	if (pkey == -1)
+		goto out;
+
+	ret = arch_set_user_pkey_access(current, pkey, init_val);
+	if (ret) {
+		mm_pkey_free(current->mm, pkey);
+		goto out;
+	}
+	ret = pkey;
+out:
+	up_write(&current->mm->mmap_sem);
+	return ret;
+}
+
+SYSCALL_DEFINE1(pkey_free, int, pkey)
+{
+	int ret;
+
+	down_write(&current->mm->mmap_sem);
+	ret = mm_pkey_free(current->mm, pkey);
+	up_write(&current->mm->mmap_sem);
+
+	/*
+	 * We could provie warnings or errors if any VMA still
+	 * has the pkey set here.
+	 */
+	return ret;
+}
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 31/34] x86, pkeys: allocation/free syscalls
@ 2015-12-04  1:15   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

This patch adds two new system calls:

	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
	int pkey_free(int pkey);

These establish which protection keys are valid for use by
userspace.  A key which was not obtained by pkey_alloc() may not
be passed to pkey_mprotect().

In addition, the 'init_access_rights' argument to pkey_alloc() specifies
the rights that will be established for the returned pkey.  For instance

	pkey = pkey_alloc(flags, PKEY_DENY_WRITE);

will return with the bits set in PKRU such that writing to 'pkey' is
already denied.  This keeps userspace from needing to have knowledge
about manipulating PKRU.  It is still free to do so if it wishes, but
it is no longer required.

The kernel does _not_ enforce that this interface must be used for
changes to PKRU, even for keys it does not control.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    2 
 b/arch/x86/entry/syscalls/syscall_64.tbl |    2 
 b/arch/x86/include/asm/mmu.h             |    7 ++
 b/arch/x86/include/asm/mmu_context.h     |    8 +++
 b/arch/x86/include/asm/pgtable.h         |    5 +-
 b/arch/x86/include/asm/pkeys.h           |   55 ++++++++++++++++++++++
 b/arch/x86/kernel/fpu/xstate.c           |   75 +++++++++++++++++++++++++++++++
 b/include/linux/pkeys.h                  |   23 +++++++++
 b/include/uapi/asm-generic/mman-common.h |    5 ++
 b/mm/mprotect.c                          |   59 +++++++++++++++++++++++-
 10 files changed, 238 insertions(+), 3 deletions(-)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkey-allocation-syscalls arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkey-allocation-syscalls	2015-12-03 16:21:32.484982342 -0800
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2015-12-03 16:21:32.502983159 -0800
@@ -384,3 +384,5 @@
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
 377	i386	pkey_mprotect		sys_pkey_mprotect
+378	i386	pkey_alloc		sys_pkey_alloc
+379	i386	pkey_free		sys_pkey_free
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkey-allocation-syscalls arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkey-allocation-syscalls	2015-12-03 16:21:32.485982388 -0800
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2015-12-03 16:21:32.502983159 -0800
@@ -333,6 +333,8 @@
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
 326	common	pkey_mprotect		sys_pkey_mprotect
+327	common	pkey_alloc		sys_pkey_alloc
+328	common	pkey_free		sys_pkey_free
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/asm/mmu_context.h~pkey-allocation-syscalls arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkey-allocation-syscalls	2015-12-03 16:21:32.487982478 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:32.503983204 -0800
@@ -108,7 +108,12 @@ static inline void enter_lazy_tlb(struct
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/* pkey 0 is the default and always allocated */
+	mm->context.pkey_allocation_map = 0x1;
+#endif
 	init_new_context_ldt(tsk, mm);
+
 	return 0;
 }
 static inline void destroy_context(struct mm_struct *mm)
@@ -333,4 +338,7 @@ static inline bool arch_pte_access_permi
 	return __pkru_allows_pkey(pte_flags_pkey(pte_flags(pte)), write);
 }
 
+extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val);
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/mmu.h~pkey-allocation-syscalls arch/x86/include/asm/mmu.h
--- a/arch/x86/include/asm/mmu.h~pkey-allocation-syscalls	2015-12-03 16:21:32.489982569 -0800
+++ b/arch/x86/include/asm/mmu.h	2015-12-03 16:21:32.503983204 -0800
@@ -22,6 +22,13 @@ typedef struct {
 	void __user *vdso;
 
 	atomic_t perf_rdpmc_allowed;	/* nonzero if rdpmc is allowed */
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/*
+	 * One bit per protection key says whether userspace can
+	 * use it or not.  protected by mmap_sem.
+	 */
+	u16 pkey_allocation_map;
+#endif
 } mm_context_t;
 
 #ifdef CONFIG_SMP
diff -puN arch/x86/include/asm/pgtable.h~pkey-allocation-syscalls arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkey-allocation-syscalls	2015-12-03 16:21:32.490982614 -0800
+++ b/arch/x86/include/asm/pgtable.h	2015-12-03 16:21:32.503983204 -0800
@@ -912,16 +912,17 @@ static inline pte_t pte_swp_clear_soft_d
 
 #define PKRU_AD_BIT 0x1
 #define PKRU_WD_BIT 0x2
+#define PKRU_BITS_PER_PKEY 2
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * 2;
+	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
 	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * 2;
+	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
 	/*
 	 * Access-disable disables writes too so we need to check
 	 * both bits here.
diff -puN arch/x86/include/asm/pkeys.h~pkey-allocation-syscalls arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkey-allocation-syscalls	2015-12-03 16:21:32.492982705 -0800
+++ b/arch/x86/include/asm/pkeys.h	2015-12-03 16:21:32.504983249 -0800
@@ -7,6 +7,61 @@
 
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
 
+#define mm_pkey_allocation_map(mm)	(mm->context.pkey_allocation_map)
+#define mm_set_pkey_allocated(mm, pkey) do {		\
+	mm_pkey_allocation_map(mm) |= (1 << pkey);	\
+} while (0)
+#define mm_set_pkey_free(mm, pkey) do {			\
+	mm_pkey_allocation_map(mm) &= ~(1 << pkey);	\
+} while (0)
+
+static inline
+bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey)
+{
+	if (!arch_validate_pkey(pkey))
+		return true;
+
+	return mm_pkey_allocation_map(mm) & (1 << pkey);
+}
+
+static inline
+int mm_pkey_alloc(struct mm_struct *mm)
+{
+	int all_pkeys_mask = ((1 << arch_max_pkey()) - 1);
+	int ret;
+
+	/*
+	 * Are we out of pkeys?  We must handle this specially
+	 * because ffz() behavior is undefined if there are no
+	 * zeros.
+	 */
+	if (mm_pkey_allocation_map(mm) == all_pkeys_mask)
+		return -1;
+
+	ret = ffz(mm_pkey_allocation_map(mm));
+
+	mm_set_pkey_allocated(mm, ret);
+
+	return ret;
+}
+
+static inline
+int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+	/*
+	 * pkey 0 is special, always allocated and can never
+	 * be freed.
+	 */
+	if (!pkey || !arch_validate_pkey(pkey))
+		return -EINVAL;
+	if (!mm_pkey_is_allocated(mm, pkey))
+		return -EINVAL;
+
+	mm_set_pkey_free(mm, pkey);
+
+	return 0;
+}
+
 #endif /*_ASM_X86_PKEYS_H */
 
 
diff -puN arch/x86/kernel/fpu/xstate.c~pkey-allocation-syscalls arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkey-allocation-syscalls	2015-12-03 16:21:32.494982796 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2015-12-03 16:21:32.504983249 -0800
@@ -5,6 +5,8 @@
  */
 #include <linux/compat.h>
 #include <linux/cpu.h>
+#include <linux/mman.h>
+#include <linux/pkeys.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -775,6 +777,7 @@ const void *get_xsave_field_ptr(int xsav
 	return get_xsave_addr(&fpu->state.xsave, xsave_state);
 }
 
+#ifdef CONFIG_ARCH_HAS_PKEYS
 
 /*
  * Set xfeatures (aka XSTATE_BV) bit for a feature that we want
@@ -855,6 +858,78 @@ out:
 	 * and (possibly) move the fpstate back in to the fpregs.
 	 */
 	fpu__current_fpstate_write_end();
+}
+
+#define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2)
+#define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1)
+
+/*
+ * This will go out and modify the XSAVE buffer so that PKRU is
+ * set to a particular state for access to 'pkey'.
+ *
+ * PKRU state does affect kernel access to user memory.  We do
+ * not modfiy PKRU *itself* here, only the XSAVE state that will
+ * be restored in to PKRU when we return back to userspace.
+ */
+int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val)
+{
+	struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+	struct pkru_state *old_pkru_state;
+	struct pkru_state new_pkru_state;
+	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+	u32 new_pkru_bits = 0;
+
+	if (!arch_validate_pkey(pkey))
+		return -EINVAL;
+	/*
+	 * This check implies XSAVE support.  OSPKE only gets
+	 * set if we enable XSAVE and we enable PKU in XCR0.
+	 */
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return -EINVAL;
+
+	/* Set the bits we need in PKRU  */
+	if (init_val & PKEY_DISABLE_ACCESS)
+		new_pkru_bits |= PKRU_AD_BIT;
+	if (init_val & PKEY_DISABLE_WRITE)
+		new_pkru_bits |= PKRU_WD_BIT;
+
+	/* Shift the bits in to the correct place in PKRU for pkey. */
+	new_pkru_bits <<= pkey_shift;
+
+	/* Locate old copy of the state in the xsave buffer */
+	old_pkru_state = get_xsave_addr(xsave, XFEATURE_MASK_PKRU);
+
+	/*
+	 * When state is not in the buffer, it is in the init
+	 * state, set it manually.  Otherwise, copy out the old
+	 * state.
+	 */
+	if (!old_pkru_state)
+		new_pkru_state.pkru = 0;
+	else
+		new_pkru_state.pkru = old_pkru_state->pkru;
+
+	/* mask off any old bits in place */
+	new_pkru_state.pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
+	/* Set the newly-requested bits */
+	new_pkru_state.pkru |= new_pkru_bits;
+
+	/*
+	 * We could theoretically live without zeroing pkru.pad.
+	 * The current XSAVE feature state definition says that
+	 * only bytes 0->3 are used.  But we do not want to
+	 * chance leaking kernel stack out to userspace in case a
+	 * memcpy() of the whole xsave buffer was done.
+	 *
+	 * They're in the same cacheline anyway.
+	 */
+	new_pkru_state.pad = 0;
+
+	fpu__xfeature_set_state(XFEATURE_MASK_PKRU, &new_pkru_state,
+			sizeof(new_pkru_state));
 
 	return 0;
 }
+#endif /* CONFIG_ARCH_HAS_PKEYS */
diff -puN include/linux/pkeys.h~pkey-allocation-syscalls include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkey-allocation-syscalls	2015-12-03 16:21:32.495982841 -0800
+++ b/include/linux/pkeys.h	2015-12-03 16:21:32.504983249 -0800
@@ -23,6 +23,29 @@ static inline int vma_pkey(struct vm_are
 {
 	return 0;
 }
+
+static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
+{
+	return (pkey == 0);
+}
+
+static inline int mm_pkey_alloc(struct mm_struct *mm)
+{
+	return -1;
+}
+
+static inline int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+	WARN_ONCE(1, "free of protection key when disabled");
+	return -EINVAL;
+}
+
+static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+			unsigned long init_val)
+{
+	return 0;
+}
+
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
diff -puN include/uapi/asm-generic/mman-common.h~pkey-allocation-syscalls include/uapi/asm-generic/mman-common.h
--- a/include/uapi/asm-generic/mman-common.h~pkey-allocation-syscalls	2015-12-03 16:21:32.497982932 -0800
+++ b/include/uapi/asm-generic/mman-common.h	2015-12-03 16:21:32.505983295 -0800
@@ -71,4 +71,9 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define PKEY_DISABLE_ACCESS	0x1
+#define PKEY_DISABLE_WRITE	0x2
+#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
+				 PKEY_DISABLE_WRITE)
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff -puN mm/mprotect.c~pkey-allocation-syscalls mm/mprotect.c
--- a/mm/mprotect.c~pkey-allocation-syscalls	2015-12-03 16:21:32.498982977 -0800
+++ b/mm/mprotect.c	2015-12-03 16:21:32.505983295 -0800
@@ -23,11 +23,13 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
+#include <linux/pkeys.h>
 #include <linux/ksm.h>
 #include <linux/pkeys.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
+#include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
 
 #include "internal.h"
@@ -355,6 +357,8 @@ static int do_mprotect_pkey(unsigned lon
 	struct vm_area_struct *vma, *prev;
 	int error = -EINVAL;
 	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
+	int plain_mprotect = (pkey == -1);
+
 	prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP);
 	if (grows == (PROT_GROWSDOWN|PROT_GROWSUP)) /* can't be both */
 		return -EINVAL;
@@ -379,6 +383,14 @@ static int do_mprotect_pkey(unsigned lon
 
 	down_write(&current->mm->mmap_sem);
 
+	/*
+	 * If userspace did not allocate the pkey, do not let
+	 * them use it here.
+	 */
+	error = -EINVAL;
+	if (!plain_mprotect && !mm_pkey_is_allocated(current->mm, pkey))
+		goto out;
+
 	vma = find_vma(current->mm, start);
 	error = -ENOMEM;
 	if (!vma)
@@ -420,7 +432,7 @@ static int do_mprotect_pkey(unsigned lon
 		 * If this is a vanilla, non-pkey mprotect, inherit the
 		 * pkey from the VMA we are working on.
 		 */
-		if (pkey == -1)
+		if (plain_mprotect)
 			newflags = calc_vm_prot_bits(prot, vma_pkey(vma));
 		else
 			newflags = calc_vm_prot_bits(prot, pkey);
@@ -474,3 +486,48 @@ SYSCALL_DEFINE4(pkey_mprotect, unsigned
 
 	return do_mprotect_pkey(start, len, prot, pkey);
 }
+
+SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
+{
+	int pkey;
+	int ret;
+
+	/* No flags supported yet. */
+	if (flags)
+		return -EINVAL;
+	/* check for unsupported init values */
+	if (init_val & ~PKEY_ACCESS_MASK)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	pkey = mm_pkey_alloc(current->mm);
+
+	ret = -ENOSPC;
+	if (pkey == -1)
+		goto out;
+
+	ret = arch_set_user_pkey_access(current, pkey, init_val);
+	if (ret) {
+		mm_pkey_free(current->mm, pkey);
+		goto out;
+	}
+	ret = pkey;
+out:
+	up_write(&current->mm->mmap_sem);
+	return ret;
+}
+
+SYSCALL_DEFINE1(pkey_free, int, pkey)
+{
+	int ret;
+
+	down_write(&current->mm->mmap_sem);
+	ret = mm_pkey_free(current->mm, pkey);
+	up_write(&current->mm->mmap_sem);
+
+	/*
+	 * We could provie warnings or errors if any VMA still
+	 * has the pkey set here.
+	 */
+	return ret;
+}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 32/34] x86, pkeys: add pkey set/get syscalls
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:15   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

This establishes two more system calls for protection key management:

	unsigned long pkey_get(int pkey);
	int pkey_set(int pkey, unsigned long access_rights);

The return value from pkey_get() and the 'access_rights' passed
to pkey_set() are the same format: a bitmask containing
PKEY_DENY_WRITE and/or PKEY_DENY_ACCESS, or nothing set at all.

These replace userspace's direct use of rdpkru/wrpkru.

With current hardware, the kernel can not enforce that it has
control over a given key.  But, this at least allows the kernel
to indicate to userspace that userspace does not control a given
protection key.

The kernel does _not_ enforce that this interface must be used for
changes to PKRU, even for keys it has not "allocated".

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    2 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 b/arch/x86/include/asm/mmu_context.h     |    2 +
 b/arch/x86/include/asm/pkeys.h           |    2 -
 b/arch/x86/kernel/fpu/xstate.c           |   55 +++++++++++++++++++++++++++++--
 b/include/linux/pkeys.h                  |    8 ++++
 b/mm/mprotect.c                          |   34 +++++++++++++++++++
 7 files changed, 102 insertions(+), 3 deletions(-)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkey-syscalls-set-get arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkey-syscalls-set-get	2015-12-03 16:21:33.139012003 -0800
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2015-12-03 16:21:33.151012548 -0800
@@ -386,3 +386,5 @@
 377	i386	pkey_mprotect		sys_pkey_mprotect
 378	i386	pkey_alloc		sys_pkey_alloc
 379	i386	pkey_free		sys_pkey_free
+380	i386	pkey_get		sys_pkey_get
+381	i386	pkey_set		sys_pkey_set
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkey-syscalls-set-get arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkey-syscalls-set-get	2015-12-03 16:21:33.141012094 -0800
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2015-12-03 16:21:33.152012593 -0800
@@ -335,6 +335,8 @@
 326	common	pkey_mprotect		sys_pkey_mprotect
 327	common	pkey_alloc		sys_pkey_alloc
 328	common	pkey_free		sys_pkey_free
+329	common	pkey_get		sys_pkey_get
+330	common	pkey_set		sys_pkey_set
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/asm/mmu_context.h~pkey-syscalls-set-get arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkey-syscalls-set-get	2015-12-03 16:21:33.142012139 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:33.152012593 -0800
@@ -340,5 +340,7 @@ static inline bool arch_pte_access_permi
 
 extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
+extern unsigned long arch_get_user_pkey_access(struct task_struct *tsk,
+		int pkey);
 
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pkeys.h~pkey-syscalls-set-get arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkey-syscalls-set-get	2015-12-03 16:21:33.144012230 -0800
+++ b/arch/x86/include/asm/pkeys.h	2015-12-03 16:21:33.152012593 -0800
@@ -16,7 +16,7 @@
 } while (0)
 
 static inline
-bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey)
+bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
 	if (!arch_validate_pkey(pkey))
 		return true;
diff -puN arch/x86/kernel/fpu/xstate.c~pkey-syscalls-set-get arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkey-syscalls-set-get	2015-12-03 16:21:33.145012275 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2015-12-03 16:21:33.153012638 -0800
@@ -687,7 +687,7 @@ void fpu__resume_cpu(void)
  *
  * Note: does not work for compacted buffers.
  */
-void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
+static void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
 {
 	int feature_nr = fls64(xstate_feature_mask) - 1;
 
@@ -862,6 +862,7 @@ out:
 
 #define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2)
 #define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1)
+#define PKRU_INIT_STATE	0
 
 /*
  * This will go out and modify the XSAVE buffer so that PKRU is
@@ -880,6 +881,9 @@ int arch_set_user_pkey_access(struct tas
 	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
 	u32 new_pkru_bits = 0;
 
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -EINVAL;
 	if (!arch_validate_pkey(pkey))
 		return -EINVAL;
 	/*
@@ -907,7 +911,7 @@ int arch_set_user_pkey_access(struct tas
 	 * state.
 	 */
 	if (!old_pkru_state)
-		new_pkru_state.pkru = 0;
+		new_pkru_state.pkru = PKRU_INIT_STATE;
 	else
 		new_pkru_state.pkru = old_pkru_state->pkru;
 
@@ -932,4 +936,51 @@ int arch_set_user_pkey_access(struct tas
 
 	return 0;
 }
+
+/*
+ * Figures out what the rights are currently for 'pkey'.
+ * Converts from PKRU's format to the user-visible PKEY_DISABLE_*
+ * format.
+ */
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	struct fpu *fpu = &current->thread.fpu;
+	u32 pkru_reg;
+	int ret = 0;
+
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -1;
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return -1;
+	/*
+	 * The contents of PKRU itself are invalid.  Consult the
+	 * task's XSAVE buffer for PKRU contents.  This is much
+	 * more expensive than reading PKRU directly, but should
+	 * be rare or impossible with eagerfpu mode.
+	 */
+	if (!fpu->fpregs_active) {
+		struct xregs_state *xsave = &fpu->state.xsave;
+		struct pkru_state *pkru_state =
+			get_xsave_addr(xsave, XFEATURE_MASK_PKRU);
+		/*
+		 * PKRU is in its init state and not present in
+		 * the buffer in a saved form.
+		 */
+		if (!pkru_state)
+			return PKRU_INIT_STATE;
+
+		return pkru_state->pkru;
+	}
+	/*
+	 * Consult the user register directly.
+	 */
+	pkru_reg = read_pkru();
+	if (!__pkru_allows_read(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_ACCESS;
+	if (!__pkru_allows_write(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_WRITE;
+
+	return ret;
+}
 #endif /* CONFIG_ARCH_HAS_PKEYS */
diff -puN include/linux/pkeys.h~pkey-syscalls-set-get include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkey-syscalls-set-get	2015-12-03 16:21:33.147012366 -0800
+++ b/include/linux/pkeys.h	2015-12-03 16:21:33.153012638 -0800
@@ -43,6 +43,14 @@ static inline int mm_pkey_free(struct mm
 static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 			unsigned long init_val)
 {
+	return -EINVAL;
+}
+
+static inline
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	if (pkey)
+		return -1;
 	return 0;
 }
 
diff -puN mm/mprotect.c~pkey-syscalls-set-get mm/mprotect.c
--- a/mm/mprotect.c~pkey-syscalls-set-get	2015-12-03 16:21:33.148012412 -0800
+++ b/mm/mprotect.c	2015-12-03 16:21:33.154012684 -0800
@@ -531,3 +531,37 @@ SYSCALL_DEFINE1(pkey_free, int, pkey)
 	 */
 	return ret;
 }
+
+SYSCALL_DEFINE1(pkey_get, int, pkey)
+{
+	unsigned long ret = 0;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_get_user_pkey_access(current, pkey);
+
+	return ret;
+}
+
+SYSCALL_DEFINE2(pkey_set, int, pkey, unsigned long, access_rights)
+{
+	unsigned long ret = 0;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_set_user_pkey_access(current, pkey, access_rights);
+
+	return ret;
+}
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 32/34] x86, pkeys: add pkey set/get syscalls
@ 2015-12-04  1:15   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

This establishes two more system calls for protection key management:

	unsigned long pkey_get(int pkey);
	int pkey_set(int pkey, unsigned long access_rights);

The return value from pkey_get() and the 'access_rights' passed
to pkey_set() are the same format: a bitmask containing
PKEY_DENY_WRITE and/or PKEY_DENY_ACCESS, or nothing set at all.

These replace userspace's direct use of rdpkru/wrpkru.

With current hardware, the kernel can not enforce that it has
control over a given key.  But, this at least allows the kernel
to indicate to userspace that userspace does not control a given
protection key.

The kernel does _not_ enforce that this interface must be used for
changes to PKRU, even for keys it has not "allocated".

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    2 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 b/arch/x86/include/asm/mmu_context.h     |    2 +
 b/arch/x86/include/asm/pkeys.h           |    2 -
 b/arch/x86/kernel/fpu/xstate.c           |   55 +++++++++++++++++++++++++++++--
 b/include/linux/pkeys.h                  |    8 ++++
 b/mm/mprotect.c                          |   34 +++++++++++++++++++
 7 files changed, 102 insertions(+), 3 deletions(-)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkey-syscalls-set-get arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkey-syscalls-set-get	2015-12-03 16:21:33.139012003 -0800
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2015-12-03 16:21:33.151012548 -0800
@@ -386,3 +386,5 @@
 377	i386	pkey_mprotect		sys_pkey_mprotect
 378	i386	pkey_alloc		sys_pkey_alloc
 379	i386	pkey_free		sys_pkey_free
+380	i386	pkey_get		sys_pkey_get
+381	i386	pkey_set		sys_pkey_set
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkey-syscalls-set-get arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkey-syscalls-set-get	2015-12-03 16:21:33.141012094 -0800
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2015-12-03 16:21:33.152012593 -0800
@@ -335,6 +335,8 @@
 326	common	pkey_mprotect		sys_pkey_mprotect
 327	common	pkey_alloc		sys_pkey_alloc
 328	common	pkey_free		sys_pkey_free
+329	common	pkey_get		sys_pkey_get
+330	common	pkey_set		sys_pkey_set
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/asm/mmu_context.h~pkey-syscalls-set-get arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkey-syscalls-set-get	2015-12-03 16:21:33.142012139 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:33.152012593 -0800
@@ -340,5 +340,7 @@ static inline bool arch_pte_access_permi
 
 extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
+extern unsigned long arch_get_user_pkey_access(struct task_struct *tsk,
+		int pkey);
 
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pkeys.h~pkey-syscalls-set-get arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkey-syscalls-set-get	2015-12-03 16:21:33.144012230 -0800
+++ b/arch/x86/include/asm/pkeys.h	2015-12-03 16:21:33.152012593 -0800
@@ -16,7 +16,7 @@
 } while (0)
 
 static inline
-bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey)
+bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
 	if (!arch_validate_pkey(pkey))
 		return true;
diff -puN arch/x86/kernel/fpu/xstate.c~pkey-syscalls-set-get arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkey-syscalls-set-get	2015-12-03 16:21:33.145012275 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2015-12-03 16:21:33.153012638 -0800
@@ -687,7 +687,7 @@ void fpu__resume_cpu(void)
  *
  * Note: does not work for compacted buffers.
  */
-void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
+static void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
 {
 	int feature_nr = fls64(xstate_feature_mask) - 1;
 
@@ -862,6 +862,7 @@ out:
 
 #define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2)
 #define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1)
+#define PKRU_INIT_STATE	0
 
 /*
  * This will go out and modify the XSAVE buffer so that PKRU is
@@ -880,6 +881,9 @@ int arch_set_user_pkey_access(struct tas
 	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
 	u32 new_pkru_bits = 0;
 
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -EINVAL;
 	if (!arch_validate_pkey(pkey))
 		return -EINVAL;
 	/*
@@ -907,7 +911,7 @@ int arch_set_user_pkey_access(struct tas
 	 * state.
 	 */
 	if (!old_pkru_state)
-		new_pkru_state.pkru = 0;
+		new_pkru_state.pkru = PKRU_INIT_STATE;
 	else
 		new_pkru_state.pkru = old_pkru_state->pkru;
 
@@ -932,4 +936,51 @@ int arch_set_user_pkey_access(struct tas
 
 	return 0;
 }
+
+/*
+ * Figures out what the rights are currently for 'pkey'.
+ * Converts from PKRU's format to the user-visible PKEY_DISABLE_*
+ * format.
+ */
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	struct fpu *fpu = &current->thread.fpu;
+	u32 pkru_reg;
+	int ret = 0;
+
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -1;
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return -1;
+	/*
+	 * The contents of PKRU itself are invalid.  Consult the
+	 * task's XSAVE buffer for PKRU contents.  This is much
+	 * more expensive than reading PKRU directly, but should
+	 * be rare or impossible with eagerfpu mode.
+	 */
+	if (!fpu->fpregs_active) {
+		struct xregs_state *xsave = &fpu->state.xsave;
+		struct pkru_state *pkru_state =
+			get_xsave_addr(xsave, XFEATURE_MASK_PKRU);
+		/*
+		 * PKRU is in its init state and not present in
+		 * the buffer in a saved form.
+		 */
+		if (!pkru_state)
+			return PKRU_INIT_STATE;
+
+		return pkru_state->pkru;
+	}
+	/*
+	 * Consult the user register directly.
+	 */
+	pkru_reg = read_pkru();
+	if (!__pkru_allows_read(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_ACCESS;
+	if (!__pkru_allows_write(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_WRITE;
+
+	return ret;
+}
 #endif /* CONFIG_ARCH_HAS_PKEYS */
diff -puN include/linux/pkeys.h~pkey-syscalls-set-get include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkey-syscalls-set-get	2015-12-03 16:21:33.147012366 -0800
+++ b/include/linux/pkeys.h	2015-12-03 16:21:33.153012638 -0800
@@ -43,6 +43,14 @@ static inline int mm_pkey_free(struct mm
 static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 			unsigned long init_val)
 {
+	return -EINVAL;
+}
+
+static inline
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	if (pkey)
+		return -1;
 	return 0;
 }
 
diff -puN mm/mprotect.c~pkey-syscalls-set-get mm/mprotect.c
--- a/mm/mprotect.c~pkey-syscalls-set-get	2015-12-03 16:21:33.148012412 -0800
+++ b/mm/mprotect.c	2015-12-03 16:21:33.154012684 -0800
@@ -531,3 +531,37 @@ SYSCALL_DEFINE1(pkey_free, int, pkey)
 	 */
 	return ret;
 }
+
+SYSCALL_DEFINE1(pkey_get, int, pkey)
+{
+	unsigned long ret = 0;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_get_user_pkey_access(current, pkey);
+
+	return ret;
+}
+
+SYSCALL_DEFINE2(pkey_set, int, pkey, unsigned long, access_rights)
+{
+	unsigned long ret = 0;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_set_user_pkey_access(current, pkey, access_rights);
+
+	return ret;
+}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 33/34] x86, pkeys: actually enable Memory Protection Keys in CPU
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:15   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This sets the bit in 'cr4' to actually enable the protection
keys feature.  We also include a boot-time disable for the
feature "nopku".

Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE cpuid
bit to appear set.  At this point in boot, identify_cpu()
has already run the actual CPUID instructions and populated
the "cpu features" structures.  We need to go back and
re-run identify_cpu() to make sure it gets updated values.

We *could* simply re-populate the 11th word of the cpuid
data, but this is probably quick enough.

Also note that with the cpu_has() check and X86_FEATURE_PKU
present in disabled-features.h, we do not need an #ifdef
for setup_pku().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/Documentation/kernel-parameters.txt |    3 ++
 b/arch/x86/kernel/cpu/common.c        |   41 ++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch	2015-12-03 16:21:33.708037809 -0800
+++ b/arch/x86/kernel/cpu/common.c	2015-12-03 16:21:33.714038081 -0800
@@ -289,6 +289,46 @@ static __always_inline void setup_smap(s
 }
 
 /*
+ * Protection Keys are not available in 32-bit mode.
+ */
+static bool pku_disabled = false;
+static __always_inline void setup_pku(struct cpuinfo_x86 *c)
+{
+	if (!cpu_has(c, X86_FEATURE_PKU))
+		return;
+	if (pku_disabled)
+		return;
+
+	cr4_set_bits(X86_CR4_PKE);
+	/*
+	 * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
+	 * cpuid bit to be set.  We need to ensure that we
+	 * update that bit in this CPU's "cpu_info".
+	 */
+	get_cpu_cap(c);
+}
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static __init int setup_disable_pku(char *arg)
+{
+	/*
+	 * Do not clear the X86_FEATURE_PKU bit.  All of the
+	 * runtime checks are against OSPKE so clearing the
+	 * bit does nothing.
+	 *
+	 * This way, we will see "pku" in cpuinfo, but not
+	 * "ospke", which is exactly what we want.  It shows
+	 * that the CPU has PKU, but the OS has not enabled it.
+	 * This happens to be exactly how a system would look
+	 * if we disabled the config option.
+	 */
+	pr_info("x86: 'nopku' specified, disabling Memory Protection Keys\n");
+	pku_disabled = true;
+	return 1;
+}
+__setup("nopku", setup_disable_pku);
+#endif /* CONFIG_X86_64 */
+
+/*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
  * software.  Add those features to this table to auto-disable them.
@@ -948,6 +988,7 @@ static void identify_cpu(struct cpuinfo_
 	init_hypervisor(c);
 	x86_init_rdrand(c);
 	x86_init_cache_qos(c);
+	setup_pku(c);
 
 	/*
 	 * Clear/Set all flags overriden by options, need do it
diff -puN Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch	2015-12-03 16:21:33.710037900 -0800
+++ b/Documentation/kernel-parameters.txt	2015-12-03 16:21:33.715038127 -0800
@@ -958,6 +958,9 @@ bytes respectively. Such letter suffixes
 			See Documentation/x86/intel_mpx.txt for more
 			information about the feature.
 
+	nopku		[X86] Disable Memory Protection Keys CPU feature found
+			in some Intel CPUs.
+
 	eagerfpu=	[X86]
 			on	enable eager fpu restore
 			off	disable eager fpu restore
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 33/34] x86, pkeys: actually enable Memory Protection Keys in CPU
@ 2015-12-04  1:15   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This sets the bit in 'cr4' to actually enable the protection
keys feature.  We also include a boot-time disable for the
feature "nopku".

Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE cpuid
bit to appear set.  At this point in boot, identify_cpu()
has already run the actual CPUID instructions and populated
the "cpu features" structures.  We need to go back and
re-run identify_cpu() to make sure it gets updated values.

We *could* simply re-populate the 11th word of the cpuid
data, but this is probably quick enough.

Also note that with the cpu_has() check and X86_FEATURE_PKU
present in disabled-features.h, we do not need an #ifdef
for setup_pku().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/Documentation/kernel-parameters.txt |    3 ++
 b/arch/x86/kernel/cpu/common.c        |   41 ++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch	2015-12-03 16:21:33.708037809 -0800
+++ b/arch/x86/kernel/cpu/common.c	2015-12-03 16:21:33.714038081 -0800
@@ -289,6 +289,46 @@ static __always_inline void setup_smap(s
 }
 
 /*
+ * Protection Keys are not available in 32-bit mode.
+ */
+static bool pku_disabled = false;
+static __always_inline void setup_pku(struct cpuinfo_x86 *c)
+{
+	if (!cpu_has(c, X86_FEATURE_PKU))
+		return;
+	if (pku_disabled)
+		return;
+
+	cr4_set_bits(X86_CR4_PKE);
+	/*
+	 * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
+	 * cpuid bit to be set.  We need to ensure that we
+	 * update that bit in this CPU's "cpu_info".
+	 */
+	get_cpu_cap(c);
+}
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static __init int setup_disable_pku(char *arg)
+{
+	/*
+	 * Do not clear the X86_FEATURE_PKU bit.  All of the
+	 * runtime checks are against OSPKE so clearing the
+	 * bit does nothing.
+	 *
+	 * This way, we will see "pku" in cpuinfo, but not
+	 * "ospke", which is exactly what we want.  It shows
+	 * that the CPU has PKU, but the OS has not enabled it.
+	 * This happens to be exactly how a system would look
+	 * if we disabled the config option.
+	 */
+	pr_info("x86: 'nopku' specified, disabling Memory Protection Keys\n");
+	pku_disabled = true;
+	return 1;
+}
+__setup("nopku", setup_disable_pku);
+#endif /* CONFIG_X86_64 */
+
+/*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
  * software.  Add those features to this table to auto-disable them.
@@ -948,6 +988,7 @@ static void identify_cpu(struct cpuinfo_
 	init_hypervisor(c);
 	x86_init_rdrand(c);
 	x86_init_cache_qos(c);
+	setup_pku(c);
 
 	/*
 	 * Clear/Set all flags overriden by options, need do it
diff -puN Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch	2015-12-03 16:21:33.710037900 -0800
+++ b/Documentation/kernel-parameters.txt	2015-12-03 16:21:33.715038127 -0800
@@ -958,6 +958,9 @@ bytes respectively. Such letter suffixes
 			See Documentation/x86/intel_mpx.txt for more
 			information about the feature.
 
+	nopku		[X86] Disable Memory Protection Keys CPU feature found
+			in some Intel CPUs.
+
 	eagerfpu=	[X86]
 			on	enable eager fpu restore
 			off	disable eager fpu restore
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 34/34] x86, pkeys: Documentation
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04  1:15   ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>


Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/Documentation/x86/protection-keys.txt |   53 ++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff -puN /dev/null Documentation/x86/protection-keys.txt
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/Documentation/x86/protection-keys.txt	2015-12-03 16:22:15.486932540 -0800
@@ -0,0 +1,53 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
+which will be found on future Intel CPUs.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.  It works by
+dedicating 4 previously ignored bits in each page table entry to a
+"protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key.  Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register.  The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs.  These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+	mprotect(ptr, size, PROT_NONE);
+	something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+	sys_pkey_alloc(no_flag, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
+	sys_mprotect_pkey(ptr, size, PROT_READ|PROT_WRITE);
+	something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+	*ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+	read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
+
+=========================== Config Option ===========================
+
+This config option adds approximately 1.5kb of text. and 50 bytes of
+data to the executable.  A workload which does large O_DIRECT reads
+of holes in XFS files was run to exercise get_user_pages_fast().  No
+performance delta was observed with the config option
+enabled or disabled.
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH 34/34] x86, pkeys: Documentation
@ 2015-12-04  1:15   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04  1:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>


Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/Documentation/x86/protection-keys.txt |   53 ++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff -puN /dev/null Documentation/x86/protection-keys.txt
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/Documentation/x86/protection-keys.txt	2015-12-03 16:22:15.486932540 -0800
@@ -0,0 +1,53 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
+which will be found on future Intel CPUs.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.  It works by
+dedicating 4 previously ignored bits in each page table entry to a
+"protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key.  Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register.  The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs.  These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+	mprotect(ptr, size, PROT_NONE);
+	something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+	sys_pkey_alloc(no_flag, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
+	sys_mprotect_pkey(ptr, size, PROT_READ|PROT_WRITE);
+	something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+	*ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+	read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
+
+=========================== Config Option ===========================
+
+This config option adds approximately 1.5kb of text. and 50 bytes of
+data to the executable.  A workload which does large O_DIRECT reads
+of holes in XFS files was run to exercise get_user_pages_fast().  No
+performance delta was observed with the config option
+enabled or disabled.
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 00/34] x86: Memory Protection Keys (v5)
  2015-12-04  1:14 ` Dave Hansen
@ 2015-12-04 23:31   ` Andy Lutomirski
  -1 siblings, 0 replies; 145+ messages in thread
From: Andy Lutomirski @ 2015-12-04 23:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, X86 ML,
	Linux API, linux-arch, Andrea Arcangeli, Andrew Morton, Jan Kara,
	Kirill A. Shutemov, Naoya Horiguchi

On Thu, Dec 3, 2015 at 5:14 PM, Dave Hansen <dave@sr71.net> wrote:
> Memory Protection Keys for User pages is a CPU feature which will
> first appear on Skylake Servers, but will also be supported on
> future non-server parts.  It provides a mechanism for enforcing
> page-based protections, but without requiring modification of the
> page tables when an application changes protection domains.  See
> the Documentation/ patch for more details.

What, if anything, happened to the signal handling parts?

Also, do you have a git tree for this somewhere?  I can't actually
enable it (my laptop, while very shiny, is not a Skylake server), but
I can poke around a bit.

--Andy

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 00/34] x86: Memory Protection Keys (v5)
@ 2015-12-04 23:31   ` Andy Lutomirski
  0 siblings, 0 replies; 145+ messages in thread
From: Andy Lutomirski @ 2015-12-04 23:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, X86 ML,
	Linux API, linux-arch, Andrea Arcangeli, Andrew Morton, Jan Kara,
	Kirill A. Shutemov, Naoya Horiguchi

On Thu, Dec 3, 2015 at 5:14 PM, Dave Hansen <dave@sr71.net> wrote:
> Memory Protection Keys for User pages is a CPU feature which will
> first appear on Skylake Servers, but will also be supported on
> future non-server parts.  It provides a mechanism for enforcing
> page-based protections, but without requiring modification of the
> page tables when an application changes protection domains.  See
> the Documentation/ patch for more details.

What, if anything, happened to the signal handling parts?

Also, do you have a git tree for this somewhere?  I can't actually
enable it (my laptop, while very shiny, is not a Skylake server), but
I can poke around a bit.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 00/34] x86: Memory Protection Keys (v5)
  2015-12-04 23:31   ` Andy Lutomirski
  (?)
@ 2015-12-04 23:38     ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04 23:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, X86 ML,
	Linux API, linux-arch, Andrea Arcangeli, Andrew Morton, Jan Kara,
	Kirill A. Shutemov, Naoya Horiguchi

On 12/04/2015 03:31 PM, Andy Lutomirski wrote:
> On Thu, Dec 3, 2015 at 5:14 PM, Dave Hansen <dave@sr71.net> wrote:
>> Memory Protection Keys for User pages is a CPU feature which will
>> first appear on Skylake Servers, but will also be supported on
>> future non-server parts.  It provides a mechanism for enforcing
>> page-based protections, but without requiring modification of the
>> page tables when an application changes protection domains.  See
>> the Documentation/ patch for more details.
> 
> What, if anything, happened to the signal handling parts?

Patches 12 and 13 contain most of it:

	x86, pkeys: fill in pkey field in siginfo
	signals, pkeys: notify userspace about protection key faults	

I decided to just not try to preserve the pkey_get/set() semantics
across entering and returning from signals, fwiw.

> Also, do you have a git tree for this somewhere?  I can't actually
> enable it (my laptop, while very shiny, is not a Skylake server), but
> I can poke around a bit.

http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/

Thanks for taking a look!

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 00/34] x86: Memory Protection Keys (v5)
@ 2015-12-04 23:38     ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04 23:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, X86 ML,
	Linux API, linux-arch, Andrea Arcangeli, Andrew Morton, Jan Kara,
	Kirill A. Shutemov, Naoya Horiguchi

On 12/04/2015 03:31 PM, Andy Lutomirski wrote:
> On Thu, Dec 3, 2015 at 5:14 PM, Dave Hansen <dave@sr71.net> wrote:
>> Memory Protection Keys for User pages is a CPU feature which will
>> first appear on Skylake Servers, but will also be supported on
>> future non-server parts.  It provides a mechanism for enforcing
>> page-based protections, but without requiring modification of the
>> page tables when an application changes protection domains.  See
>> the Documentation/ patch for more details.
> 
> What, if anything, happened to the signal handling parts?

Patches 12 and 13 contain most of it:

	x86, pkeys: fill in pkey field in siginfo
	signals, pkeys: notify userspace about protection key faults	

I decided to just not try to preserve the pkey_get/set() semantics
across entering and returning from signals, fwiw.

> Also, do you have a git tree for this somewhere?  I can't actually
> enable it (my laptop, while very shiny, is not a Skylake server), but
> I can poke around a bit.

http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/

Thanks for taking a look!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 00/34] x86: Memory Protection Keys (v5)
@ 2015-12-04 23:38     ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-04 23:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, X86 ML,
	Linux API, linux-arch, Andrea Arcangeli, Andrew Morton, Jan Kara,
	Kirill A. Shutemov, Naoya Horiguchi

On 12/04/2015 03:31 PM, Andy Lutomirski wrote:
> On Thu, Dec 3, 2015 at 5:14 PM, Dave Hansen <dave-gkUM19QKKo4@public.gmane.org> wrote:
>> Memory Protection Keys for User pages is a CPU feature which will
>> first appear on Skylake Servers, but will also be supported on
>> future non-server parts.  It provides a mechanism for enforcing
>> page-based protections, but without requiring modification of the
>> page tables when an application changes protection domains.  See
>> the Documentation/ patch for more details.
> 
> What, if anything, happened to the signal handling parts?

Patches 12 and 13 contain most of it:

	x86, pkeys: fill in pkey field in siginfo
	signals, pkeys: notify userspace about protection key faults	

I decided to just not try to preserve the pkey_get/set() semantics
across entering and returning from signals, fwiw.

> Also, do you have a git tree for this somewhere?  I can't actually
> enable it (my laptop, while very shiny, is not a Skylake server), but
> I can poke around a bit.

http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/

Thanks for taking a look!

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
  2015-12-04  1:15   ` Dave Hansen
  (?)
@ 2015-12-05  6:50     ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 145+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-05  6:50 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: mtk.manpages, linux-mm, x86, dave.hansen, linux-api

Dave,

On 12/04/2015 02:15 AM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> mprotect_key() is just like mprotect, except it also takes a
> protection key as an argument.  On systems that do not support
> protection keys, it still works, but requires that key=0.
> Otherwise it does exactly what mprotect does.

Is there a man page for this API?

Thanks,

Michael


> I expect it to get used like this, if you want to guarantee that
> any mapping you create can *never* be accessed without the right
> protection keys set up.
> 
> 	pkey_deny_access(11); // random pkey
> 	int real_prot = PROT_READ|PROT_WRITE;
> 	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> 	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);
> 
> This way, there is *no* window where the mapping is accessible
> since it was always either PROT_NONE or had a protection key set.
> 
> We settled on 'unsigned long' for the type of the key here.  We
> only need 4 bits on x86 today, but I figured that other
> architectures might need some more space.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: linux-api@vger.kernel.org
> ---
> 
>  b/arch/x86/include/asm/mmu_context.h |   10 +++++++--
>  b/include/linux/pkeys.h              |    7 +++++-
>  b/mm/Kconfig                         |    7 ++++++
>  b/mm/mprotect.c                      |   36 +++++++++++++++++++++++++++++------
>  4 files changed, 51 insertions(+), 9 deletions(-)
> 
> diff -puN arch/x86/include/asm/mmu_context.h~pkeys-85-mprotect_pkey arch/x86/include/asm/mmu_context.h
> --- a/arch/x86/include/asm/mmu_context.h~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.181877894 -0800
> +++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:30.190878302 -0800
> @@ -4,6 +4,7 @@
>  #include <asm/desc.h>
>  #include <linux/atomic.h>
>  #include <linux/mm_types.h>
> +#include <linux/pkeys.h>
>  
>  #include <trace/events/tlb.h>
>  
> @@ -243,10 +244,14 @@ static inline void arch_unmap(struct mm_
>  		mpx_notify_unmap(mm, vma, start, end);
>  }
>  
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> +/*
> + * If the config option is off, we get the generic version from
> + * include/linux/pkeys.h.
> + */
>  static inline int vma_pkey(struct vm_area_struct *vma)
>  {
>  	u16 pkey = 0;
> -#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>  	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
>  				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
>  	/*
> @@ -259,9 +264,10 @@ static inline int vma_pkey(struct vm_are
>  	 */
>  	pkey = (vma->vm_flags >> vm_pkey_shift) &
>  	       (vma_pkey_mask >> vm_pkey_shift);
> -#endif
> +
>  	return pkey;
>  }
> +#endif
>  
>  static inline bool __pkru_allows_pkey(u16 pkey, bool write)
>  {
> diff -puN include/linux/pkeys.h~pkeys-85-mprotect_pkey include/linux/pkeys.h
> --- a/include/linux/pkeys.h~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.183877985 -0800
> +++ b/include/linux/pkeys.h	2015-12-03 16:21:30.190878302 -0800
> @@ -2,10 +2,10 @@
>  #define _LINUX_PKEYS_H
>  
>  #include <linux/mm_types.h>
> -#include <asm/mmu_context.h>
>  
>  #ifdef CONFIG_ARCH_HAS_PKEYS
>  #include <asm/pkeys.h>
> +#include <asm/mmu_context.h>
>  #else /* ! CONFIG_ARCH_HAS_PKEYS */
>  
>  /*
> @@ -17,6 +17,11 @@ static inline bool arch_validate_pkey(in
>  {
>  	return true;
>  }
> +
> +static inline int vma_pkey(struct vm_area_struct *vma)
> +{
> +	return 0;
> +}
>  #endif /* ! CONFIG_ARCH_HAS_PKEYS */
>  
>  #endif /* _LINUX_PKEYS_H */
> diff -puN mm/Kconfig~pkeys-85-mprotect_pkey mm/Kconfig
> --- a/mm/Kconfig~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.185878075 -0800
> +++ b/mm/Kconfig	2015-12-03 16:21:30.190878302 -0800
> @@ -673,3 +673,10 @@ config ARCH_USES_HIGH_VMA_FLAGS
>  	bool
>  config ARCH_HAS_PKEYS
>  	bool
> +
> +config NR_PROTECTION_KEYS
> +	int
> +	# Everything supports a _single_ key, so allow folks to
> +	# at least call APIs that take keys, but require that the
> +	# key be 0.
> +	default 1
> diff -puN mm/mprotect.c~pkeys-85-mprotect_pkey mm/mprotect.c
> --- a/mm/mprotect.c~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.186878121 -0800
> +++ b/mm/mprotect.c	2015-12-03 16:21:30.191878347 -0800
> @@ -24,6 +24,7 @@
>  #include <linux/migrate.h>
>  #include <linux/perf_event.h>
>  #include <linux/ksm.h>
> +#include <linux/pkeys.h>
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
>  #include <asm/cacheflush.h>
> @@ -344,10 +345,13 @@ fail:
>  	return error;
>  }
>  
> -SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> -		unsigned long, prot)
> +/*
> + * pkey=-1 when doing a legacy mprotect()
> + */
> +static int do_mprotect_pkey(unsigned long start, size_t len,
> +		unsigned long prot, int pkey)
>  {
> -	unsigned long vm_flags, nstart, end, tmp, reqprot;
> +	unsigned long nstart, end, tmp, reqprot;
>  	struct vm_area_struct *vma, *prev;
>  	int error = -EINVAL;
>  	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
> @@ -373,8 +377,6 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
>  	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
>  		prot |= PROT_EXEC;
>  
> -	vm_flags = calc_vm_prot_bits(prot, 0);
> -
>  	down_write(&current->mm->mmap_sem);
>  
>  	vma = find_vma(current->mm, start);
> @@ -407,7 +409,14 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
>  
>  		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
>  
> -		newflags = vm_flags;
> +		/*
> +		 * If this is a vanilla, non-pkey mprotect, inherit the
> +		 * pkey from the VMA we are working on.
> +		 */
> +		if (pkey == -1)
> +			newflags = calc_vm_prot_bits(prot, vma_pkey(vma));
> +		else
> +			newflags = calc_vm_prot_bits(prot, pkey);
>  		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
>  
>  		/* newflags >> 4 shift VM_MAY% in place of VM_% */
> @@ -443,3 +452,18 @@ out:
>  	up_write(&current->mm->mmap_sem);
>  	return error;
>  }
> +
> +SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> +		unsigned long, prot)
> +{
> +	return do_mprotect_pkey(start, len, prot, -1);
> +}
> +
> +SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
> +		unsigned long, prot, int, pkey)
> +{
> +	if (!arch_validate_pkey(pkey))
> +		return -EINVAL;
> +
> +	return do_mprotect_pkey(start, len, prot, pkey);
> +}
> _
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-05  6:50     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 145+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-05  6:50 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: mtk.manpages, linux-mm, x86, dave.hansen, linux-api

Dave,

On 12/04/2015 02:15 AM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> mprotect_key() is just like mprotect, except it also takes a
> protection key as an argument.  On systems that do not support
> protection keys, it still works, but requires that key=0.
> Otherwise it does exactly what mprotect does.

Is there a man page for this API?

Thanks,

Michael


> I expect it to get used like this, if you want to guarantee that
> any mapping you create can *never* be accessed without the right
> protection keys set up.
> 
> 	pkey_deny_access(11); // random pkey
> 	int real_prot = PROT_READ|PROT_WRITE;
> 	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> 	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);
> 
> This way, there is *no* window where the mapping is accessible
> since it was always either PROT_NONE or had a protection key set.
> 
> We settled on 'unsigned long' for the type of the key here.  We
> only need 4 bits on x86 today, but I figured that other
> architectures might need some more space.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: linux-api@vger.kernel.org
> ---
> 
>  b/arch/x86/include/asm/mmu_context.h |   10 +++++++--
>  b/include/linux/pkeys.h              |    7 +++++-
>  b/mm/Kconfig                         |    7 ++++++
>  b/mm/mprotect.c                      |   36 +++++++++++++++++++++++++++++------
>  4 files changed, 51 insertions(+), 9 deletions(-)
> 
> diff -puN arch/x86/include/asm/mmu_context.h~pkeys-85-mprotect_pkey arch/x86/include/asm/mmu_context.h
> --- a/arch/x86/include/asm/mmu_context.h~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.181877894 -0800
> +++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:30.190878302 -0800
> @@ -4,6 +4,7 @@
>  #include <asm/desc.h>
>  #include <linux/atomic.h>
>  #include <linux/mm_types.h>
> +#include <linux/pkeys.h>
>  
>  #include <trace/events/tlb.h>
>  
> @@ -243,10 +244,14 @@ static inline void arch_unmap(struct mm_
>  		mpx_notify_unmap(mm, vma, start, end);
>  }
>  
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> +/*
> + * If the config option is off, we get the generic version from
> + * include/linux/pkeys.h.
> + */
>  static inline int vma_pkey(struct vm_area_struct *vma)
>  {
>  	u16 pkey = 0;
> -#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>  	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
>  				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
>  	/*
> @@ -259,9 +264,10 @@ static inline int vma_pkey(struct vm_are
>  	 */
>  	pkey = (vma->vm_flags >> vm_pkey_shift) &
>  	       (vma_pkey_mask >> vm_pkey_shift);
> -#endif
> +
>  	return pkey;
>  }
> +#endif
>  
>  static inline bool __pkru_allows_pkey(u16 pkey, bool write)
>  {
> diff -puN include/linux/pkeys.h~pkeys-85-mprotect_pkey include/linux/pkeys.h
> --- a/include/linux/pkeys.h~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.183877985 -0800
> +++ b/include/linux/pkeys.h	2015-12-03 16:21:30.190878302 -0800
> @@ -2,10 +2,10 @@
>  #define _LINUX_PKEYS_H
>  
>  #include <linux/mm_types.h>
> -#include <asm/mmu_context.h>
>  
>  #ifdef CONFIG_ARCH_HAS_PKEYS
>  #include <asm/pkeys.h>
> +#include <asm/mmu_context.h>
>  #else /* ! CONFIG_ARCH_HAS_PKEYS */
>  
>  /*
> @@ -17,6 +17,11 @@ static inline bool arch_validate_pkey(in
>  {
>  	return true;
>  }
> +
> +static inline int vma_pkey(struct vm_area_struct *vma)
> +{
> +	return 0;
> +}
>  #endif /* ! CONFIG_ARCH_HAS_PKEYS */
>  
>  #endif /* _LINUX_PKEYS_H */
> diff -puN mm/Kconfig~pkeys-85-mprotect_pkey mm/Kconfig
> --- a/mm/Kconfig~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.185878075 -0800
> +++ b/mm/Kconfig	2015-12-03 16:21:30.190878302 -0800
> @@ -673,3 +673,10 @@ config ARCH_USES_HIGH_VMA_FLAGS
>  	bool
>  config ARCH_HAS_PKEYS
>  	bool
> +
> +config NR_PROTECTION_KEYS
> +	int
> +	# Everything supports a _single_ key, so allow folks to
> +	# at least call APIs that take keys, but require that the
> +	# key be 0.
> +	default 1
> diff -puN mm/mprotect.c~pkeys-85-mprotect_pkey mm/mprotect.c
> --- a/mm/mprotect.c~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.186878121 -0800
> +++ b/mm/mprotect.c	2015-12-03 16:21:30.191878347 -0800
> @@ -24,6 +24,7 @@
>  #include <linux/migrate.h>
>  #include <linux/perf_event.h>
>  #include <linux/ksm.h>
> +#include <linux/pkeys.h>
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
>  #include <asm/cacheflush.h>
> @@ -344,10 +345,13 @@ fail:
>  	return error;
>  }
>  
> -SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> -		unsigned long, prot)
> +/*
> + * pkey=-1 when doing a legacy mprotect()
> + */
> +static int do_mprotect_pkey(unsigned long start, size_t len,
> +		unsigned long prot, int pkey)
>  {
> -	unsigned long vm_flags, nstart, end, tmp, reqprot;
> +	unsigned long nstart, end, tmp, reqprot;
>  	struct vm_area_struct *vma, *prev;
>  	int error = -EINVAL;
>  	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
> @@ -373,8 +377,6 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
>  	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
>  		prot |= PROT_EXEC;
>  
> -	vm_flags = calc_vm_prot_bits(prot, 0);
> -
>  	down_write(&current->mm->mmap_sem);
>  
>  	vma = find_vma(current->mm, start);
> @@ -407,7 +409,14 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
>  
>  		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
>  
> -		newflags = vm_flags;
> +		/*
> +		 * If this is a vanilla, non-pkey mprotect, inherit the
> +		 * pkey from the VMA we are working on.
> +		 */
> +		if (pkey == -1)
> +			newflags = calc_vm_prot_bits(prot, vma_pkey(vma));
> +		else
> +			newflags = calc_vm_prot_bits(prot, pkey);
>  		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
>  
>  		/* newflags >> 4 shift VM_MAY% in place of VM_% */
> @@ -443,3 +452,18 @@ out:
>  	up_write(&current->mm->mmap_sem);
>  	return error;
>  }
> +
> +SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> +		unsigned long, prot)
> +{
> +	return do_mprotect_pkey(start, len, prot, -1);
> +}
> +
> +SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
> +		unsigned long, prot, int, pkey)
> +{
> +	if (!arch_validate_pkey(pkey))
> +		return -EINVAL;
> +
> +	return do_mprotect_pkey(start, len, prot, pkey);
> +}
> _
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-05  6:50     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 145+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-05  6:50 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, x86-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Dave,

On 12/04/2015 02:15 AM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> 
> mprotect_key() is just like mprotect, except it also takes a
> protection key as an argument.  On systems that do not support
> protection keys, it still works, but requires that key=0.
> Otherwise it does exactly what mprotect does.

Is there a man page for this API?

Thanks,

Michael


> I expect it to get used like this, if you want to guarantee that
> any mapping you create can *never* be accessed without the right
> protection keys set up.
> 
> 	pkey_deny_access(11); // random pkey
> 	int real_prot = PROT_READ|PROT_WRITE;
> 	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> 	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);
> 
> This way, there is *no* window where the mapping is accessible
> since it was always either PROT_NONE or had a protection key set.
> 
> We settled on 'unsigned long' for the type of the key here.  We
> only need 4 bits on x86 today, but I figured that other
> architectures might need some more space.
> 
> Signed-off-by: Dave Hansen <dave.hansen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
> 
>  b/arch/x86/include/asm/mmu_context.h |   10 +++++++--
>  b/include/linux/pkeys.h              |    7 +++++-
>  b/mm/Kconfig                         |    7 ++++++
>  b/mm/mprotect.c                      |   36 +++++++++++++++++++++++++++++------
>  4 files changed, 51 insertions(+), 9 deletions(-)
> 
> diff -puN arch/x86/include/asm/mmu_context.h~pkeys-85-mprotect_pkey arch/x86/include/asm/mmu_context.h
> --- a/arch/x86/include/asm/mmu_context.h~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.181877894 -0800
> +++ b/arch/x86/include/asm/mmu_context.h	2015-12-03 16:21:30.190878302 -0800
> @@ -4,6 +4,7 @@
>  #include <asm/desc.h>
>  #include <linux/atomic.h>
>  #include <linux/mm_types.h>
> +#include <linux/pkeys.h>
>  
>  #include <trace/events/tlb.h>
>  
> @@ -243,10 +244,14 @@ static inline void arch_unmap(struct mm_
>  		mpx_notify_unmap(mm, vma, start, end);
>  }
>  
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> +/*
> + * If the config option is off, we get the generic version from
> + * include/linux/pkeys.h.
> + */
>  static inline int vma_pkey(struct vm_area_struct *vma)
>  {
>  	u16 pkey = 0;
> -#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>  	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
>  				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
>  	/*
> @@ -259,9 +264,10 @@ static inline int vma_pkey(struct vm_are
>  	 */
>  	pkey = (vma->vm_flags >> vm_pkey_shift) &
>  	       (vma_pkey_mask >> vm_pkey_shift);
> -#endif
> +
>  	return pkey;
>  }
> +#endif
>  
>  static inline bool __pkru_allows_pkey(u16 pkey, bool write)
>  {
> diff -puN include/linux/pkeys.h~pkeys-85-mprotect_pkey include/linux/pkeys.h
> --- a/include/linux/pkeys.h~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.183877985 -0800
> +++ b/include/linux/pkeys.h	2015-12-03 16:21:30.190878302 -0800
> @@ -2,10 +2,10 @@
>  #define _LINUX_PKEYS_H
>  
>  #include <linux/mm_types.h>
> -#include <asm/mmu_context.h>
>  
>  #ifdef CONFIG_ARCH_HAS_PKEYS
>  #include <asm/pkeys.h>
> +#include <asm/mmu_context.h>
>  #else /* ! CONFIG_ARCH_HAS_PKEYS */
>  
>  /*
> @@ -17,6 +17,11 @@ static inline bool arch_validate_pkey(in
>  {
>  	return true;
>  }
> +
> +static inline int vma_pkey(struct vm_area_struct *vma)
> +{
> +	return 0;
> +}
>  #endif /* ! CONFIG_ARCH_HAS_PKEYS */
>  
>  #endif /* _LINUX_PKEYS_H */
> diff -puN mm/Kconfig~pkeys-85-mprotect_pkey mm/Kconfig
> --- a/mm/Kconfig~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.185878075 -0800
> +++ b/mm/Kconfig	2015-12-03 16:21:30.190878302 -0800
> @@ -673,3 +673,10 @@ config ARCH_USES_HIGH_VMA_FLAGS
>  	bool
>  config ARCH_HAS_PKEYS
>  	bool
> +
> +config NR_PROTECTION_KEYS
> +	int
> +	# Everything supports a _single_ key, so allow folks to
> +	# at least call APIs that take keys, but require that the
> +	# key be 0.
> +	default 1
> diff -puN mm/mprotect.c~pkeys-85-mprotect_pkey mm/mprotect.c
> --- a/mm/mprotect.c~pkeys-85-mprotect_pkey	2015-12-03 16:21:30.186878121 -0800
> +++ b/mm/mprotect.c	2015-12-03 16:21:30.191878347 -0800
> @@ -24,6 +24,7 @@
>  #include <linux/migrate.h>
>  #include <linux/perf_event.h>
>  #include <linux/ksm.h>
> +#include <linux/pkeys.h>
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
>  #include <asm/cacheflush.h>
> @@ -344,10 +345,13 @@ fail:
>  	return error;
>  }
>  
> -SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> -		unsigned long, prot)
> +/*
> + * pkey=-1 when doing a legacy mprotect()
> + */
> +static int do_mprotect_pkey(unsigned long start, size_t len,
> +		unsigned long prot, int pkey)
>  {
> -	unsigned long vm_flags, nstart, end, tmp, reqprot;
> +	unsigned long nstart, end, tmp, reqprot;
>  	struct vm_area_struct *vma, *prev;
>  	int error = -EINVAL;
>  	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
> @@ -373,8 +377,6 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
>  	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
>  		prot |= PROT_EXEC;
>  
> -	vm_flags = calc_vm_prot_bits(prot, 0);
> -
>  	down_write(&current->mm->mmap_sem);
>  
>  	vma = find_vma(current->mm, start);
> @@ -407,7 +409,14 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
>  
>  		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
>  
> -		newflags = vm_flags;
> +		/*
> +		 * If this is a vanilla, non-pkey mprotect, inherit the
> +		 * pkey from the VMA we are working on.
> +		 */
> +		if (pkey == -1)
> +			newflags = calc_vm_prot_bits(prot, vma_pkey(vma));
> +		else
> +			newflags = calc_vm_prot_bits(prot, pkey);
>  		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
>  
>  		/* newflags >> 4 shift VM_MAY% in place of VM_% */
> @@ -443,3 +452,18 @@ out:
>  	up_write(&current->mm->mmap_sem);
>  	return error;
>  }
> +
> +SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
> +		unsigned long, prot)
> +{
> +	return do_mprotect_pkey(start, len, prot, -1);
> +}
> +
> +SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
> +		unsigned long, prot, int, pkey)
> +{
> +	if (!arch_validate_pkey(pkey))
> +		return -EINVAL;
> +
> +	return do_mprotect_pkey(start, len, prot, pkey);
> +}
> _
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-07 16:44       ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-07 16:44 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), linux-kernel
  Cc: linux-mm, x86, dave.hansen, linux-api

[-- Attachment #1: Type: text/plain, Size: 613 bytes --]

On 12/04/2015 10:50 PM, Michael Kerrisk (man-pages) wrote:
> On 12/04/2015 02:15 AM, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> mprotect_key() is just like mprotect, except it also takes a
>> protection key as an argument.  On systems that do not support
>> protection keys, it still works, but requires that key=0.
>> Otherwise it does exactly what mprotect does.
> 
> Is there a man page for this API?

Yep.  Patch to man-pages source is attached.  I actually broke it up in
to a few separate pages.  I was planning on submitting these after the
patches themselves go upstream.

[-- Attachment #2: pkeys.patch --]
[-- Type: text/x-patch, Size: 6639 bytes --]

commit ebb12643876810931ed23992f92b7c77c2c36883
Author: Dave Hansen <dave.hansen@intel.com>
Date:   Mon Dec 7 08:42:57 2015 -0800

    pkeys

diff --git a/man2/mprotect.2 b/man2/mprotect.2
index ae305f6..a3c1e62 100644
--- a/man2/mprotect.2
+++ b/man2/mprotect.2
@@ -38,16 +38,19 @@
 .\"
 .TH MPROTECT 2 2015-07-23 "Linux" "Linux Programmer's Manual"
 .SH NAME
-mprotect \- set protection on a region of memory
+mprotect, mprotect_key \- set protection on a region of memory
 .SH SYNOPSIS
 .nf
 .B #include <sys/mman.h>
 .sp
 .BI "int mprotect(void *" addr ", size_t " len ", int " prot );
+.BI "int mprotect_key(void *" addr ", size_t " len ", int " prot , " int " key);
 .fi
 .SH DESCRIPTION
 .BR mprotect ()
-changes protection for the calling process's memory page(s)
+and
+.BR mprotect_key ()
+change protection for the calling process's memory page(s)
 containing any part of the address range in the
 interval [\fIaddr\fP,\ \fIaddr\fP+\fIlen\fP\-1].
 .I addr
@@ -74,10 +77,17 @@ The memory can be modified.
 .TP
 .B PROT_EXEC
 The memory can be executed.
+.PP
+.I key
+is the protection or storage key to assign to the memory.
+A key must be allocated with pkey_alloc () before it is
+passed to pkey_mprotect ().
 .SH RETURN VALUE
 On success,
 .BR mprotect ()
-returns zero.
+and
+.BR mprotect_key ()
+return zero.
 On error, \-1 is returned, and
 .I errno
 is set appropriately.
diff --git a/man2/pkey_alloc.2 b/man2/pkey_alloc.2
new file mode 100644
index 0000000..980ce3e
--- /dev/null
+++ b/man2/pkey_alloc.2
@@ -0,0 +1,72 @@
+.\" Copyright (C) 2007 Michael Kerrisk <mtk.manpages@gmail.com>
+.\" and Copyright (C) 1995 Michael Shields <shields@tembel.org>.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and author of this work.
+.\" %%%LICENSE_END
+.\"
+.\" Modified 2015-12-04 by Dave Hansen <dave@sr71.net>
+.\"
+.\"
+.TH PKEY_ALLOC 2 2015-12-04 "Linux" "Linux Programmer's Manual"
+.SH NAME
+pkey_alloc, pkey_free \- allocate or free a protection key
+.SH SYNOPSIS
+.nf
+.B #include <sys/mman.h>
+.sp
+.BI "int pkey_alloc(unsigned long" flags ", unsigned long " init_val);
+.BI "int pkey_free(int " pkey);
+.fi
+.SH DESCRIPTION
+.BR pkey_alloc ()
+and
+.BR pkey_free ()
+allow or disallow the calling process's to use the given
+protection key for all protection-key-related operations.
+
+.PP
+.I flags
+is may contain zero or more disable operation:
+.B PKEY_DISABLE_ACCESS
+and/or
+.B PKEY_DISABLE_WRITE
+.SH RETURN VALUE
+On success,
+.BR pkey_alloc ()
+and
+.BR pkey_free ()
+return zero.
+On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EINVAL
+An invalid protection key, flag, or init_val was specified.
+.TP
+.B ENOSPC
+All protection keys available for the current process have
+been allocated.
+.SH SEE ALSO
+.BR mprotect_pkey (2),
+.BR pkey_get (2),
+.BR pkey_set (2),
diff --git a/man2/pkey_get.2 b/man2/pkey_get.2
new file mode 100644
index 0000000..4cfdea9
--- /dev/null
+++ b/man2/pkey_get.2
@@ -0,0 +1,76 @@
+.\" Copyright (C) 2007 Michael Kerrisk <mtk.manpages@gmail.com>
+.\" and Copyright (C) 1995 Michael Shields <shields@tembel.org>.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and author of this work.
+.\" %%%LICENSE_END
+.\"
+.\" Modified 2015-12-04 by Dave Hansen <dave@sr71.net>
+.\"
+.\"
+.TH PKEY_GET 2 2015-12-04 "Linux" "Linux Programmer's Manual"
+.SH NAME
+pkey_get, pkey_set \- manage protection key access permissions
+.SH SYNOPSIS
+.nf
+.B #include <sys/mman.h>
+.sp
+.BI "int pkey_get(int " pkey);
+.BI "int pkey_set(int " pkey ", unsigned long " access_rights);
+.fi
+.SH DESCRIPTION
+.BR pkey_get ()
+and
+.BR pkey_set ()
+query or set the current set of rights for the calling
+task for the given protection key.
+When rights for a key are disabled, any future access
+to any memory region with that key set will generate
+a SIGSEGV.  The rights are local to the calling thread and
+do not affect any other threads.
+.PP
+Upon entering any signal handler, the process is given a
+default set of protection key rights which are separate from
+the main thread's.  Any calls to pkey_set () in a signal
+will not persist upon a return to the calling process.
+.PP
+.I access_rights
+is may contain zero or more disable operation:
+.B PKEY_DISABLE_ACCESS
+and/or
+.B PKEY_DISABLE_WRITE
+.SH RETURN VALUE
+On success,
+.BR pkey_get ()
+and
+.BR pkey_set ()
+return zero.
+On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EINVAL
+An invalid protection key or access_rights was specified.
+.SH SEE ALSO
+.BR mprotect_pkey (2),
+.BR pkey_alloc (2),
+.BR pkey_free (2),

^ permalink raw reply related	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-07 16:44       ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-07 16:44 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, x86-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 637 bytes --]

On 12/04/2015 10:50 PM, Michael Kerrisk (man-pages) wrote:
> On 12/04/2015 02:15 AM, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>>
>> mprotect_key() is just like mprotect, except it also takes a
>> protection key as an argument.  On systems that do not support
>> protection keys, it still works, but requires that key=0.
>> Otherwise it does exactly what mprotect does.
> 
> Is there a man page for this API?

Yep.  Patch to man-pages source is attached.  I actually broke it up in
to a few separate pages.  I was planning on submitting these after the
patches themselves go upstream.

[-- Attachment #2: pkeys.patch --]
[-- Type: text/x-patch, Size: 6827 bytes --]

commit ebb12643876810931ed23992f92b7c77c2c36883
Author: Dave Hansen <dave.hansen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Date:   Mon Dec 7 08:42:57 2015 -0800

    pkeys

diff --git a/man2/mprotect.2 b/man2/mprotect.2
index ae305f6..a3c1e62 100644
--- a/man2/mprotect.2
+++ b/man2/mprotect.2
@@ -38,16 +38,19 @@
 .\"
 .TH MPROTECT 2 2015-07-23 "Linux" "Linux Programmer's Manual"
 .SH NAME
-mprotect \- set protection on a region of memory
+mprotect, mprotect_key \- set protection on a region of memory
 .SH SYNOPSIS
 .nf
 .B #include <sys/mman.h>
 .sp
 .BI "int mprotect(void *" addr ", size_t " len ", int " prot );
+.BI "int mprotect_key(void *" addr ", size_t " len ", int " prot , " int " key);
 .fi
 .SH DESCRIPTION
 .BR mprotect ()
-changes protection for the calling process's memory page(s)
+and
+.BR mprotect_key ()
+change protection for the calling process's memory page(s)
 containing any part of the address range in the
 interval [\fIaddr\fP,\ \fIaddr\fP+\fIlen\fP\-1].
 .I addr
@@ -74,10 +77,17 @@ The memory can be modified.
 .TP
 .B PROT_EXEC
 The memory can be executed.
+.PP
+.I key
+is the protection or storage key to assign to the memory.
+A key must be allocated with pkey_alloc () before it is
+passed to pkey_mprotect ().
 .SH RETURN VALUE
 On success,
 .BR mprotect ()
-returns zero.
+and
+.BR mprotect_key ()
+return zero.
 On error, \-1 is returned, and
 .I errno
 is set appropriately.
diff --git a/man2/pkey_alloc.2 b/man2/pkey_alloc.2
new file mode 100644
index 0000000..980ce3e
--- /dev/null
+++ b/man2/pkey_alloc.2
@@ -0,0 +1,72 @@
+.\" Copyright (C) 2007 Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
+.\" and Copyright (C) 1995 Michael Shields <shields-QLfR772JJfgdnm+yROfE0A@public.gmane.org>.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and author of this work.
+.\" %%%LICENSE_END
+.\"
+.\" Modified 2015-12-04 by Dave Hansen <dave-gkUM19QKKo4@public.gmane.org>
+.\"
+.\"
+.TH PKEY_ALLOC 2 2015-12-04 "Linux" "Linux Programmer's Manual"
+.SH NAME
+pkey_alloc, pkey_free \- allocate or free a protection key
+.SH SYNOPSIS
+.nf
+.B #include <sys/mman.h>
+.sp
+.BI "int pkey_alloc(unsigned long" flags ", unsigned long " init_val);
+.BI "int pkey_free(int " pkey);
+.fi
+.SH DESCRIPTION
+.BR pkey_alloc ()
+and
+.BR pkey_free ()
+allow or disallow the calling process's to use the given
+protection key for all protection-key-related operations.
+
+.PP
+.I flags
+is may contain zero or more disable operation:
+.B PKEY_DISABLE_ACCESS
+and/or
+.B PKEY_DISABLE_WRITE
+.SH RETURN VALUE
+On success,
+.BR pkey_alloc ()
+and
+.BR pkey_free ()
+return zero.
+On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EINVAL
+An invalid protection key, flag, or init_val was specified.
+.TP
+.B ENOSPC
+All protection keys available for the current process have
+been allocated.
+.SH SEE ALSO
+.BR mprotect_pkey (2),
+.BR pkey_get (2),
+.BR pkey_set (2),
diff --git a/man2/pkey_get.2 b/man2/pkey_get.2
new file mode 100644
index 0000000..4cfdea9
--- /dev/null
+++ b/man2/pkey_get.2
@@ -0,0 +1,76 @@
+.\" Copyright (C) 2007 Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
+.\" and Copyright (C) 1995 Michael Shields <shields-QLfR772JJfgdnm+yROfE0A@public.gmane.org>.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and author of this work.
+.\" %%%LICENSE_END
+.\"
+.\" Modified 2015-12-04 by Dave Hansen <dave-gkUM19QKKo4@public.gmane.org>
+.\"
+.\"
+.TH PKEY_GET 2 2015-12-04 "Linux" "Linux Programmer's Manual"
+.SH NAME
+pkey_get, pkey_set \- manage protection key access permissions
+.SH SYNOPSIS
+.nf
+.B #include <sys/mman.h>
+.sp
+.BI "int pkey_get(int " pkey);
+.BI "int pkey_set(int " pkey ", unsigned long " access_rights);
+.fi
+.SH DESCRIPTION
+.BR pkey_get ()
+and
+.BR pkey_set ()
+query or set the current set of rights for the calling
+task for the given protection key.
+When rights for a key are disabled, any future access
+to any memory region with that key set will generate
+a SIGSEGV.  The rights are local to the calling thread and
+do not affect any other threads.
+.PP
+Upon entering any signal handler, the process is given a
+default set of protection key rights which are separate from
+the main thread's.  Any calls to pkey_set () in a signal
+will not persist upon a return to the calling process.
+.PP
+.I access_rights
+is may contain zero or more disable operation:
+.B PKEY_DISABLE_ACCESS
+and/or
+.B PKEY_DISABLE_WRITE
+.SH RETURN VALUE
+On success,
+.BR pkey_get ()
+and
+.BR pkey_set ()
+return zero.
+On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EINVAL
+An invalid protection key or access_rights was specified.
+.SH SEE ALSO
+.BR mprotect_pkey (2),
+.BR pkey_alloc (2),
+.BR pkey_free (2),

^ permalink raw reply related	[flat|nested] 145+ messages in thread

* Re: [PATCH 09/34] x86, pkeys: store protection in high VMA flags
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 14:17     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 14:17 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> vma->vm_flags is an 'unsigned long', so has space for 32 flags
> on 32-bit architectures.  The high 32 bits are unused on 64-bit
> platforms.  We've steered away from using the unused high VMA
> bits for things because we would have difficulty supporting it
> on 32-bit.
> 
> Protection Keys are not available in 32-bit mode, so there is
> no concern about supporting this feature in 32-bit mode or on
> 32-bit CPUs.
> 
> This patch carves out 4 bits from the high half of
> vma->vm_flags and allows architectures to set config option
> to make them available.
> 
> Sparse complains about these constants unless we explicitly
> call them "UL".
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 09/34] x86, pkeys: store protection in high VMA flags
@ 2015-12-08 14:17     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 14:17 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> vma->vm_flags is an 'unsigned long', so has space for 32 flags
> on 32-bit architectures.  The high 32 bits are unused on 64-bit
> platforms.  We've steered away from using the unused high VMA
> bits for things because we would have difficulty supporting it
> on 32-bit.
> 
> Protection Keys are not available in 32-bit mode, so there is
> no concern about supporting this feature in 32-bit mode or on
> 32-bit CPUs.
> 
> This patch carves out 4 bits from the high half of
> vma->vm_flags and allows architectures to set config option
> to make them available.
> 
> Sparse complains about these constants unless we explicitly
> call them "UL".
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/34] x86, pkeys: arch-specific protection bitsy
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 15:15     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 15:15 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

Dave,

On Thu, 3 Dec 2015, Dave Hansen wrote:
>  
> +static inline int vma_pkey(struct vm_area_struct *vma)

Shouldn't this return something unsigned?

> +{
> +	u16 pkey = 0;
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> +	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
> +				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
> +	/*
> +	 * ffs is one-based, not zero-based, so bias back down by 1.
> +	 */
> +	int vm_pkey_shift = __builtin_ffsl(vma_pkey_mask) - 1;

Took me some time to figure out that this will resolve to a compile
time constant (hopefully). Is there a reason why we don't have a
VM_PKEY_SHIFT constant in the header file which makes that code just
simple and intuitive?

> +	/*
> +	 * gcc generates better code if we do this rather than:
> +	 * pkey = (flags & mask) >> shift
> +	 */
> +	pkey = (vma->vm_flags >> vm_pkey_shift) &
> +	       (vma_pkey_mask >> vm_pkey_shift);

My gcc (4.9) does it the other way round for whatever reason.

I really prefer to have this as simple as:

#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
#define VM_PKEY_MASK (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
#define VM_PKEY_SHIFT 
#else
#define VM_PKEY_MASK 0UL
#define VM_PKEY_SHIFT 0
#endif
    	 
static inline unsigned int vma_pkey(struct vm_area_struct *vma)
{
	 return (vma->vm_flags & VM_PKEY_MASK) >> VM_PKEY_SHIFT;
}

or 

#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
#define VM_PKEY_MASK (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
#define VM_PKEY_SHIFT 
static inline unsigned int vma_pkey(struct vm_area_struct *vma)
{
	 return (vma->vm_flags & VM_PKEY_MASK) >> VM_PKEY_SHIFT;
}
#else
static inline unsigned int vma_pkey(struct vm_area_struct *vma)
{
	 return 0;
}
#endif

Hmm?

	tglx

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/34] x86, pkeys: arch-specific protection bitsy
@ 2015-12-08 15:15     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 15:15 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

Dave,

On Thu, 3 Dec 2015, Dave Hansen wrote:
>  
> +static inline int vma_pkey(struct vm_area_struct *vma)

Shouldn't this return something unsigned?

> +{
> +	u16 pkey = 0;
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> +	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
> +				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
> +	/*
> +	 * ffs is one-based, not zero-based, so bias back down by 1.
> +	 */
> +	int vm_pkey_shift = __builtin_ffsl(vma_pkey_mask) - 1;

Took me some time to figure out that this will resolve to a compile
time constant (hopefully). Is there a reason why we don't have a
VM_PKEY_SHIFT constant in the header file which makes that code just
simple and intuitive?

> +	/*
> +	 * gcc generates better code if we do this rather than:
> +	 * pkey = (flags & mask) >> shift
> +	 */
> +	pkey = (vma->vm_flags >> vm_pkey_shift) &
> +	       (vma_pkey_mask >> vm_pkey_shift);

My gcc (4.9) does it the other way round for whatever reason.

I really prefer to have this as simple as:

#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
#define VM_PKEY_MASK (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
#define VM_PKEY_SHIFT 
#else
#define VM_PKEY_MASK 0UL
#define VM_PKEY_SHIFT 0
#endif
    	 
static inline unsigned int vma_pkey(struct vm_area_struct *vma)
{
	 return (vma->vm_flags & VM_PKEY_MASK) >> VM_PKEY_SHIFT;
}

or 

#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
#define VM_PKEY_MASK (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
#define VM_PKEY_SHIFT 
static inline unsigned int vma_pkey(struct vm_area_struct *vma)
{
	 return (vma->vm_flags & VM_PKEY_MASK) >> VM_PKEY_SHIFT;
}
#else
static inline unsigned int vma_pkey(struct vm_area_struct *vma)
{
	 return 0;
}
#endif

Hmm?

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 14/34] x86, pkeys: add functions to fetch PKRU
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 15:18     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 15:18 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> +static inline u32 __read_pkru(void)
> +{
> +	unsigned int ecx = 0;
> +	unsigned int edx, pkru;

  	u32 please.

Other than that: Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

> +
> +	/*
> +	 * "rdpkru" instruction.  Places PKRU contents in to EAX,

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 14/34] x86, pkeys: add functions to fetch PKRU
@ 2015-12-08 15:18     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 15:18 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> +static inline u32 __read_pkru(void)
> +{
> +	unsigned int ecx = 0;
> +	unsigned int edx, pkru;

  	u32 please.

Other than that: Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

> +
> +	/*
> +	 * "rdpkru" instruction.  Places PKRU contents in to EAX,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/34] x86, pkeys: arch-specific protection bitsy
  2015-12-08 15:15     ` Thomas Gleixner
@ 2015-12-08 16:34       ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-08 16:34 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On 12/08/2015 07:15 AM, Thomas Gleixner wrote:
> On Thu, 3 Dec 2015, Dave Hansen wrote:
>>  
>> +static inline int vma_pkey(struct vm_area_struct *vma)
> 
> Shouldn't this return something unsigned?

Ingo had asked that we use 'int' in the syscalls at some point.  We also
use a -1 to mean "no pkey set" (to differentiate it from pkey=0) at
least at the very top of the syscall level.

>> +{
>> +	u16 pkey = 0;
>> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>> +	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
>> +				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
>> +	/*
>> +	 * ffs is one-based, not zero-based, so bias back down by 1.
>> +	 */
>> +	int vm_pkey_shift = __builtin_ffsl(vma_pkey_mask) - 1;
> 
> Took me some time to figure out that this will resolve to a compile
> time constant (hopefully). Is there a reason why we don't have a
> VM_PKEY_SHIFT constant in the header file which makes that code just
> simple and intuitive?

All of the VM_* flags are #defined as bitmaps directly and don't define
shifts:

#define VM_MAYWRITE     0x00000020
#define VM_MAYEXEC      0x00000040
#define VM_MAYSHARE     0x00000080
...

So to get a shift we've either got to do a ffs somewhere, or we have to
define the VM_PKEY_BIT*'s differently from all of the other VM_* flags.
 Or, we do something along the lines of:

#define VM_PKEY_BIT0 0x100000000UL
#define __VM_PKEY_SHIFT (32)

and we run a small risk that somebody will desynchronize the shift and
the bit definition.

We only need this shift in this *one* place, so that's why I opted for
the local variable and ffs.

>> +	/*
>> +	 * gcc generates better code if we do this rather than:
>> +	 * pkey = (flags & mask) >> shift
>> +	 */
>> +	pkey = (vma->vm_flags >> vm_pkey_shift) &
>> +	       (vma_pkey_mask >> vm_pkey_shift);
> 
> My gcc (4.9) does it the other way round for whatever reason.

I'll go recheck.


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/34] x86, pkeys: arch-specific protection bitsy
@ 2015-12-08 16:34       ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-08 16:34 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On 12/08/2015 07:15 AM, Thomas Gleixner wrote:
> On Thu, 3 Dec 2015, Dave Hansen wrote:
>>  
>> +static inline int vma_pkey(struct vm_area_struct *vma)
> 
> Shouldn't this return something unsigned?

Ingo had asked that we use 'int' in the syscalls at some point.  We also
use a -1 to mean "no pkey set" (to differentiate it from pkey=0) at
least at the very top of the syscall level.

>> +{
>> +	u16 pkey = 0;
>> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>> +	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
>> +				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
>> +	/*
>> +	 * ffs is one-based, not zero-based, so bias back down by 1.
>> +	 */
>> +	int vm_pkey_shift = __builtin_ffsl(vma_pkey_mask) - 1;
> 
> Took me some time to figure out that this will resolve to a compile
> time constant (hopefully). Is there a reason why we don't have a
> VM_PKEY_SHIFT constant in the header file which makes that code just
> simple and intuitive?

All of the VM_* flags are #defined as bitmaps directly and don't define
shifts:

#define VM_MAYWRITE     0x00000020
#define VM_MAYEXEC      0x00000040
#define VM_MAYSHARE     0x00000080
...

So to get a shift we've either got to do a ffs somewhere, or we have to
define the VM_PKEY_BIT*'s differently from all of the other VM_* flags.
 Or, we do something along the lines of:

#define VM_PKEY_BIT0 0x100000000UL
#define __VM_PKEY_SHIFT (32)

and we run a small risk that somebody will desynchronize the shift and
the bit definition.

We only need this shift in this *one* place, so that's why I opted for
the local variable and ffs.

>> +	/*
>> +	 * gcc generates better code if we do this rather than:
>> +	 * pkey = (flags & mask) >> shift
>> +	 */
>> +	pkey = (vma->vm_flags >> vm_pkey_shift) &
>> +	       (vma_pkey_mask >> vm_pkey_shift);
> 
> My gcc (4.9) does it the other way round for whatever reason.

I'll go recheck.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/34] x86, pkeys: arch-specific protection bitsy
  2015-12-08 16:34       ` Dave Hansen
@ 2015-12-08 17:24         ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 17:24 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

Dave,

On Tue, 8 Dec 2015, Dave Hansen wrote:
> On 12/08/2015 07:15 AM, Thomas Gleixner wrote:
> > On Thu, 3 Dec 2015, Dave Hansen wrote:
> >>  
> >> +static inline int vma_pkey(struct vm_area_struct *vma)
> > 
> > Shouldn't this return something unsigned?
> 
> Ingo had asked that we use 'int' in the syscalls at some point.  We also
> use a -1 to mean "no pkey set" (to differentiate it from pkey=0) at
> least at the very top of the syscall level.

Ok.
 
> >> +{
> >> +	u16 pkey = 0;
> >> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> >> +	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
> >> +				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
> >> +	/*
> >> +	 * ffs is one-based, not zero-based, so bias back down by 1.
> >> +	 */
> >> +	int vm_pkey_shift = __builtin_ffsl(vma_pkey_mask) - 1;
> > 
> > Took me some time to figure out that this will resolve to a compile
> > time constant (hopefully). Is there a reason why we don't have a
> > VM_PKEY_SHIFT constant in the header file which makes that code just
> > simple and intuitive?
> 
> All of the VM_* flags are #defined as bitmaps directly and don't define
> shifts:
> 
> #define VM_MAYWRITE     0x00000020
> #define VM_MAYEXEC      0x00000040
> #define VM_MAYSHARE     0x00000080
> ...
> 
> So to get a shift we've either got to do a ffs somewhere, or we have to
> define the VM_PKEY_BIT*'s differently from all of the other VM_* flags.
>  Or, we do something along the lines of:
> 
> #define VM_PKEY_BIT0 0x100000000UL
> #define __VM_PKEY_SHIFT (32)

Well, yes. But these are the new "high" bits so we really can do it:

#define VM_KEY_BIT_SHIFT    32
#define VM_KEY_BIT0	    BIT(VM_KEY_BIT_SHIFT);
...
 
> and we run a small risk that somebody will desynchronize the shift and
> the bit definition.
> 
> We only need this shift in this *one* place, so that's why I opted for
> the local variable and ffs.
> 
> >> +	/*
> >> +	 * gcc generates better code if we do this rather than:
> >> +	 * pkey = (flags & mask) >> shift
> >> +	 */
> >> +	pkey = (vma->vm_flags >> vm_pkey_shift) &
> >> +	       (vma_pkey_mask >> vm_pkey_shift);
> > 
> > My gcc (4.9) does it the other way round for whatever reason.
> 
> I'll go recheck.

It's one instruction difference and that even depends on the offset of
vm_flags in the struct. So we really can go for the readable version :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/34] x86, pkeys: arch-specific protection bitsy
@ 2015-12-08 17:24         ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 17:24 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

Dave,

On Tue, 8 Dec 2015, Dave Hansen wrote:
> On 12/08/2015 07:15 AM, Thomas Gleixner wrote:
> > On Thu, 3 Dec 2015, Dave Hansen wrote:
> >>  
> >> +static inline int vma_pkey(struct vm_area_struct *vma)
> > 
> > Shouldn't this return something unsigned?
> 
> Ingo had asked that we use 'int' in the syscalls at some point.  We also
> use a -1 to mean "no pkey set" (to differentiate it from pkey=0) at
> least at the very top of the syscall level.

Ok.
 
> >> +{
> >> +	u16 pkey = 0;
> >> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> >> +	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
> >> +				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
> >> +	/*
> >> +	 * ffs is one-based, not zero-based, so bias back down by 1.
> >> +	 */
> >> +	int vm_pkey_shift = __builtin_ffsl(vma_pkey_mask) - 1;
> > 
> > Took me some time to figure out that this will resolve to a compile
> > time constant (hopefully). Is there a reason why we don't have a
> > VM_PKEY_SHIFT constant in the header file which makes that code just
> > simple and intuitive?
> 
> All of the VM_* flags are #defined as bitmaps directly and don't define
> shifts:
> 
> #define VM_MAYWRITE     0x00000020
> #define VM_MAYEXEC      0x00000040
> #define VM_MAYSHARE     0x00000080
> ...
> 
> So to get a shift we've either got to do a ffs somewhere, or we have to
> define the VM_PKEY_BIT*'s differently from all of the other VM_* flags.
>  Or, we do something along the lines of:
> 
> #define VM_PKEY_BIT0 0x100000000UL
> #define __VM_PKEY_SHIFT (32)

Well, yes. But these are the new "high" bits so we really can do it:

#define VM_KEY_BIT_SHIFT    32
#define VM_KEY_BIT0	    BIT(VM_KEY_BIT_SHIFT);
...
 
> and we run a small risk that somebody will desynchronize the shift and
> the bit definition.
> 
> We only need this shift in this *one* place, so that's why I opted for
> the local variable and ffs.
> 
> >> +	/*
> >> +	 * gcc generates better code if we do this rather than:
> >> +	 * pkey = (flags & mask) >> shift
> >> +	 */
> >> +	pkey = (vma->vm_flags >> vm_pkey_shift) &
> >> +	       (vma_pkey_mask >> vm_pkey_shift);
> > 
> > My gcc (4.9) does it the other way round for whatever reason.
> 
> I'll go recheck.

It's one instruction difference and that even depends on the offset of
vm_flags in the struct. So we really can go for the readable version :)

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 15/34] mm: factor out VMA fault permission checking
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 17:26     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 17:26 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> This code matches a fault condition up with the VMA and ensures
> that the VMA allows the fault to be handled instead of just
> erroring out.
> 
> We will be extending this in a moment to comprehend protection
> keys.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 15/34] mm: factor out VMA fault permission checking
@ 2015-12-08 17:26     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 17:26 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> This code matches a fault condition up with the VMA and ensures
> that the VMA allows the fault to be handled instead of just
> erroring out.
> 
> We will be extending this in a moment to comprehend protection
> keys.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 16/34] x86, mm: simplify get_user_pages() PTE bit handling
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 18:01     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:01 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The current get_user_pages() code is a wee bit more complicated
> than it needs to be for pte bit checking.  Currently, it establishes
> a mask of required pte _PAGE_* bits and ensures that the pte it
> goes after has all those bits.
> 
> This consolidates the three identical copies of this code.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/arch/x86/mm/gup.c |   45 ++++++++++++++++++++++++++++-----------------
>  1 file changed, 28 insertions(+), 17 deletions(-)
> 
> diff -puN arch/x86/mm/gup.c~pkeys-16-gup-swizzle arch/x86/mm/gup.c
> --- a/arch/x86/mm/gup.c~pkeys-16-gup-swizzle	2015-12-03 16:21:25.148649631 -0800
> +++ b/arch/x86/mm/gup.c	2015-12-03 16:21:25.151649767 -0800
> @@ -63,6 +63,30 @@ retry:
>  #endif
>  }
>  
> +static inline int pte_allows_gup(pte_t pte, int write)
> +{
> +	/*
> +	 * 'pte' can reall be a pte, pmd or pud.  We only check
> +	 * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which
> +	 * are the same value on all 3 types.
> +	 */
> +	if (!(pte_flags(pte) & (_PAGE_PRESENT|_PAGE_USER)))
> +		return 0;
> +	if (write && !(pte_write(pte)))
> +		return 0;
> +	return 1;
> +}
> +
> +static inline int pmd_allows_gup(pmd_t pmd, int write)
> +{
> +	return pte_allows_gup(*(pte_t *)&pmd, write);
> +}
> +
> +static inline int pud_allows_gup(pud_t pud, int write)
> +{
> +	return pte_allows_gup(*(pte_t *)&pud, write);
> +}

This still puzzles me. And the only reason it compiles is because we
have -fno-strict-aliasing set ...

All this operates on the pteval or even just on the pte_flags(). Even
the new arch_pte_access_permitted() thingy which you add later is only
interrested in pte_flags() and its just a wrapper around
__pkru_allows_pkey().

So for readability and simplicity sake, can we please do something
like this (pkey check already added):

/*
 * 'pteval' can reall be a pte, pmd or pud.  We only check
 * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which
 * are the same value on all 3 types.
 */
static inline int pte_allows_gup(unsigned long pteval, int write)
{
	unsigned long mask = _PAGE_PRESENT|_PAGE_USER;

	if (write)
		mask |= _PAGE_RW;

	if ((pteval & mask) != mask)
		return 0;

	if (!__pkru_allows_pkey(pte_flags_pkey(pteval), write))
	   	return 0;
	return 1;
}

and at the callsites do:

    if (pte_allows_gup(pte_val(pte, write))

    if (pte_allows_gup(pmd_val(pmd, write))

    if (pte_allows_gup(pud_val(pud, write))

Hmm?


	tglx


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 16/34] x86, mm: simplify get_user_pages() PTE bit handling
@ 2015-12-08 18:01     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:01 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The current get_user_pages() code is a wee bit more complicated
> than it needs to be for pte bit checking.  Currently, it establishes
> a mask of required pte _PAGE_* bits and ensures that the pte it
> goes after has all those bits.
> 
> This consolidates the three identical copies of this code.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/arch/x86/mm/gup.c |   45 ++++++++++++++++++++++++++++-----------------
>  1 file changed, 28 insertions(+), 17 deletions(-)
> 
> diff -puN arch/x86/mm/gup.c~pkeys-16-gup-swizzle arch/x86/mm/gup.c
> --- a/arch/x86/mm/gup.c~pkeys-16-gup-swizzle	2015-12-03 16:21:25.148649631 -0800
> +++ b/arch/x86/mm/gup.c	2015-12-03 16:21:25.151649767 -0800
> @@ -63,6 +63,30 @@ retry:
>  #endif
>  }
>  
> +static inline int pte_allows_gup(pte_t pte, int write)
> +{
> +	/*
> +	 * 'pte' can reall be a pte, pmd or pud.  We only check
> +	 * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which
> +	 * are the same value on all 3 types.
> +	 */
> +	if (!(pte_flags(pte) & (_PAGE_PRESENT|_PAGE_USER)))
> +		return 0;
> +	if (write && !(pte_write(pte)))
> +		return 0;
> +	return 1;
> +}
> +
> +static inline int pmd_allows_gup(pmd_t pmd, int write)
> +{
> +	return pte_allows_gup(*(pte_t *)&pmd, write);
> +}
> +
> +static inline int pud_allows_gup(pud_t pud, int write)
> +{
> +	return pte_allows_gup(*(pte_t *)&pud, write);
> +}

This still puzzles me. And the only reason it compiles is because we
have -fno-strict-aliasing set ...

All this operates on the pteval or even just on the pte_flags(). Even
the new arch_pte_access_permitted() thingy which you add later is only
interrested in pte_flags() and its just a wrapper around
__pkru_allows_pkey().

So for readability and simplicity sake, can we please do something
like this (pkey check already added):

/*
 * 'pteval' can reall be a pte, pmd or pud.  We only check
 * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which
 * are the same value on all 3 types.
 */
static inline int pte_allows_gup(unsigned long pteval, int write)
{
	unsigned long mask = _PAGE_PRESENT|_PAGE_USER;

	if (write)
		mask |= _PAGE_RW;

	if ((pteval & mask) != mask)
		return 0;

	if (!__pkru_allows_pkey(pte_flags_pkey(pteval), write))
	   	return 0;
	return 1;
}

and at the callsites do:

    if (pte_allows_gup(pte_val(pte, write))

    if (pte_allows_gup(pmd_val(pmd, write))

    if (pte_allows_gup(pud_val(pud, write))

Hmm?


	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/34] x86, pkeys: arch-specific protection bitsy
  2015-12-08 17:24         ` Thomas Gleixner
  (?)
@ 2015-12-08 18:06         ` Dave Hansen
  2015-12-08 18:29             ` Thomas Gleixner
  -1 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2015-12-08 18:06 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, linux-mm, x86, dave.hansen

[-- Attachment #1: Type: text/plain, Size: 170 bytes --]

Here's how it looks with the suggested modifications.

Whatever compiler wonkiness I was seeing is gone now, so I've used the
most straightforward version of the shifts.

[-- Attachment #2: pkeys-08-store-pkey-in-vma.patch --]
[-- Type: text/x-patch, Size: 5349 bytes --]


From: Dave Hansen <dave.hansen@linux.intel.com>

Lots of things seem to do:

        vma->vm_page_prot = vm_get_page_prot(flags);

and the ptes get created right from things we pull out
of ->vm_page_prot.  So it is very convenient if we can
store the protection key in flags and vm_page_prot, just
like the existing permission bits (_PAGE_RW/PRESENT).  It
greatly reduces the amount of plumbing and arch-specific
hacking we have to do in generic code.

This also takes the new PROT_PKEY{0,1,2,3} flags and
turns *those* in to VM_ flags for vma->vm_flags.

The protection key values are stored in 4 places:
	1. "prot" argument to system calls
	2. vma->vm_flags, filled from the mmap "prot"
	3. vma->vm_page prot, filled from vma->vm_flags
	4. the PTE itself.

The pseudocode for these for steps are as follows:

	mmap(PROT_PKEY*)
	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
	pte = pfn | vma->vm_page_prot

Note that this provides a new definitions for x86:

	arch_vm_get_page_prot()

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/mmu_context.h   |   15 +++++++++++++++
 b/arch/x86/include/asm/pgtable_types.h |   12 ++++++++++--
 b/arch/x86/include/uapi/asm/mman.h     |   16 ++++++++++++++++
 b/include/linux/mm.h                   |    7 +++++++
 4 files changed, 48 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-08-store-pkey-in-vma	2015-12-08 09:49:27.429201204 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2015-12-08 10:03:43.077939260 -0800
@@ -243,4 +243,19 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+	u16 pkey = 0;
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
+				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
+	/*
+	 * gcc generates better code if we do this rather than:
+	 * pkey = (flags & mask) >> shift
+	 */
+	pkey = (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
+#endif
+	return pkey;
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma	2015-12-08 09:49:27.431201294 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2015-12-08 09:49:27.438201611 -0800
@@ -111,7 +111,12 @@
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
 			 _PAGE_DIRTY)
 
-/* Set of bits not changed in pte_modify */
+/*
+ * Set of bits not changed in pte_modify.  The pte's
+ * protection key is treated like _PAGE_RW, for
+ * instance, and is *not* included in this mask since
+ * pte_modify() does modify it.
+ */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY)
@@ -227,7 +232,10 @@ enum page_cache_mode {
 /* Extracts the PFN from a (pte|pmd|pud|pgd)val_t of a 4KB page */
 #define PTE_PFN_MASK		((pteval_t)PHYSICAL_PAGE_MASK)
 
-/* Extracts the flags from a (pte|pmd|pud|pgd)val_t of a 4KB page */
+/*
+ *  Extracts the flags from a (pte|pmd|pud|pgd)val_t
+ *  This includes the protection key value.
+ */
 #define PTE_FLAGS_MASK		(~PTE_PFN_MASK)
 
 typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma	2015-12-08 09:49:27.432201339 -0800
+++ b/arch/x86/include/uapi/asm/mman.h	2015-12-08 09:49:27.438201611 -0800
@@ -6,6 +6,22 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ *
+ * Only override these if Protection Keys are available
+ * (which is only on 64-bit).
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+#endif
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff -puN include/linux/mm.h~pkeys-08-store-pkey-in-vma include/linux/mm.h
--- a/include/linux/mm.h~pkeys-08-store-pkey-in-vma	2015-12-08 09:49:27.434201430 -0800
+++ b/include/linux/mm.h	2015-12-08 09:49:27.439201656 -0800
@@ -171,6 +171,13 @@ extern unsigned int kobjsize(const void
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
+#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
+# define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
+# define VM_PKEY_BIT1	VM_HIGH_ARCH_1
+# define VM_PKEY_BIT2	VM_HIGH_ARCH_2
+# define VM_PKEY_BIT3	VM_HIGH_ARCH_3
+#endif
 #elif defined(CONFIG_PPC)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
_

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 17/34] x86, pkeys: check VMAs and PTEs for protection keys
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 18:11     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:11 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> Today, for normal faults and page table walks, we check the VMA
> and/or PTE to ensure that it is compatible with the action.  For
> instance, if we get a write fault on a non-writeable VMA, we
> SIGSEGV.
> 
> We try to do the same thing for protection keys.  Basically, we
> try to make sure that if a user does this:
> 
> 	mprotect(ptr, size, PROT_NONE);
> 	*ptr = foo;
> 
> they see the same effects with protection keys when they do this:
> 
> 	mprotect(ptr, size, PROT_READ|PROT_WRITE);
> 	set_pkey(ptr, size, 4);
> 	wrpkru(0xffffff3f); // access disable pkey 4
> 	*ptr = foo;
> 
> The state to do that checking is in the VMA, but we also
> sometimes have to do it on the page tables only, like when doing
> a get_user_pages_fast() where we have no VMA.
> 
> We add two functions and expose them to generic code:
> 
> 	arch_pte_access_permitted(pte_flags, write)
> 	arch_vma_access_permitted(vma, write)
> 
> These are, of course, backed up in x86 arch code with checks
> against the PTE or VMA's protection key.
> 
> But, there are also cases where we do not want to respect
> protection keys.  When we ptrace(), for instance, we do not want
> to apply the tracer's PKRU permissions to the PTEs from the
> process being traced.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 17/34] x86, pkeys: check VMAs and PTEs for protection keys
@ 2015-12-08 18:11     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:11 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> Today, for normal faults and page table walks, we check the VMA
> and/or PTE to ensure that it is compatible with the action.  For
> instance, if we get a write fault on a non-writeable VMA, we
> SIGSEGV.
> 
> We try to do the same thing for protection keys.  Basically, we
> try to make sure that if a user does this:
> 
> 	mprotect(ptr, size, PROT_NONE);
> 	*ptr = foo;
> 
> they see the same effects with protection keys when they do this:
> 
> 	mprotect(ptr, size, PROT_READ|PROT_WRITE);
> 	set_pkey(ptr, size, 4);
> 	wrpkru(0xffffff3f); // access disable pkey 4
> 	*ptr = foo;
> 
> The state to do that checking is in the VMA, but we also
> sometimes have to do it on the page tables only, like when doing
> a get_user_pages_fast() where we have no VMA.
> 
> We add two functions and expose them to generic code:
> 
> 	arch_pte_access_permitted(pte_flags, write)
> 	arch_vma_access_permitted(vma, write)
> 
> These are, of course, backed up in x86 arch code with checks
> against the PTE or VMA's protection key.
> 
> But, there are also cases where we do not want to respect
> protection keys.  When we ptrace(), for instance, we do not want
> to apply the tracer's PKRU permissions to the PTEs from the
> process being traced.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 19/34] x86, pkeys: optimize fault handling in access_error()
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 18:14     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:14 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> diff -puN arch/x86/mm/fault.c~pkeys-15-access_error arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~pkeys-15-access_error	2015-12-03 16:21:26.872727820 -0800
> +++ b/arch/x86/mm/fault.c	2015-12-03 16:21:26.876728002 -0800
> @@ -900,10 +900,16 @@ bad_area(struct pt_regs *regs, unsigned
>  static inline bool bad_area_access_from_pkeys(unsigned long error_code,
>  		struct vm_area_struct *vma)
>  {
> +	/* This code is always called on the current mm */
> +	int foreign = 0;

arch_vma_access_permitted takes a bool ....

>  	if (!boot_cpu_has(X86_FEATURE_OSPKE))
>  		return false;
>  	if (error_code & PF_PK)
>  		return true;
> +	/* this checks permission keys on the VMA: */
> +	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
> +		return true;
>  	return false;
>  }
>  
> @@ -1091,6 +1097,8 @@ int show_unhandled_signals = 1;
>  static inline int
>  access_error(unsigned long error_code, struct vm_area_struct *vma)
>  {
> +	/* This is only called for the current mm, so: */
> +	int foreign = 0;

Ditto.

Other than that: Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 19/34] x86, pkeys: optimize fault handling in access_error()
@ 2015-12-08 18:14     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:14 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> diff -puN arch/x86/mm/fault.c~pkeys-15-access_error arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~pkeys-15-access_error	2015-12-03 16:21:26.872727820 -0800
> +++ b/arch/x86/mm/fault.c	2015-12-03 16:21:26.876728002 -0800
> @@ -900,10 +900,16 @@ bad_area(struct pt_regs *regs, unsigned
>  static inline bool bad_area_access_from_pkeys(unsigned long error_code,
>  		struct vm_area_struct *vma)
>  {
> +	/* This code is always called on the current mm */
> +	int foreign = 0;

arch_vma_access_permitted takes a bool ....

>  	if (!boot_cpu_has(X86_FEATURE_OSPKE))
>  		return false;
>  	if (error_code & PF_PK)
>  		return true;
> +	/* this checks permission keys on the VMA: */
> +	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
> +		return true;
>  	return false;
>  }
>  
> @@ -1091,6 +1097,8 @@ int show_unhandled_signals = 1;
>  static inline int
>  access_error(unsigned long error_code, struct vm_area_struct *vma)
>  {
> +	/* This is only called for the current mm, so: */
> +	int foreign = 0;

Ditto.

Other than that: Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 20/34] x86, pkeys: differentiate instruction fetches
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 18:17     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:17 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
>  static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
> -		bool write, bool foreign)
> +		bool write, bool execute, bool foreign)
....
> +	/*
> +	 * gups are always data accesses, not instruction
> +	 * fetches, so execute=0 here

Again. Can we please be consistent about booleans?

> +	 */
> +	if (!arch_vma_access_permitted(vma, write, 0, foreign))
>  		return -EFAULT;
>  	return 0;
>  }
> @@ -576,8 +580,11 @@ bool vma_permits_fault(struct vm_area_st
>  	/*
>  	 * The architecture might have a hardware protection
>  	 * mechanism other than read/write that can deny access.
> +	 *
> +	 * gup always represents data access, not instruction
> +	 * fetches, so execute=0 here:
>  	 */
> -	if (!arch_vma_access_permitted(vma, write, foreign))
> +	if (!arch_vma_access_permitted(vma, write, 0, foreign))
>  		return false;

Ditto.

Other than that: Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 20/34] x86, pkeys: differentiate instruction fetches
@ 2015-12-08 18:17     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:17 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
>  static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
> -		bool write, bool foreign)
> +		bool write, bool execute, bool foreign)
....
> +	/*
> +	 * gups are always data accesses, not instruction
> +	 * fetches, so execute=0 here

Again. Can we please be consistent about booleans?

> +	 */
> +	if (!arch_vma_access_permitted(vma, write, 0, foreign))
>  		return -EFAULT;
>  	return 0;
>  }
> @@ -576,8 +580,11 @@ bool vma_permits_fault(struct vm_area_st
>  	/*
>  	 * The architecture might have a hardware protection
>  	 * mechanism other than read/write that can deny access.
> +	 *
> +	 * gup always represents data access, not instruction
> +	 * fetches, so execute=0 here:
>  	 */
> -	if (!arch_vma_access_permitted(vma, write, foreign))
> +	if (!arch_vma_access_permitted(vma, write, 0, foreign))
>  		return false;

Ditto.

Other than that: Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 21/34] x86, pkeys: dump PKRU with other kernel registers
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 18:19     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:19 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I'm a bit ambivalent about whether this is needed or not.
>
> Protection Keys never affect kernel mappings.  But, they can
> affect whether the kernel will fault when it touches a user
> mapping.  But, the kernel doesn't touch user mappings without
> some careful choreography and these accesses don't generally
> result in oopses.

Well, if we miss some careful choreography at some place, this
information is going to be helpful.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/arch/x86/kernel/process_64.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff -puN arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps arch/x86/kernel/process_64.c
> --- a/arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps	2015-12-03 16:21:27.874773264 -0800
> +++ b/arch/x86/kernel/process_64.c	2015-12-03 16:21:27.877773400 -0800
> @@ -116,6 +116,8 @@ void __show_regs(struct pt_regs *regs, i
>  	printk(KERN_DEFAULT "DR0: %016lx DR1: %016lx DR2: %016lx\n", d0, d1, d2);
>  	printk(KERN_DEFAULT "DR3: %016lx DR6: %016lx DR7: %016lx\n", d3, d6, d7);
>  
> +	if (boot_cpu_has(X86_FEATURE_OSPKE))
> +		printk(KERN_DEFAULT "PKRU: %08x\n", read_pkru());
>  }
>  
>  void release_thread(struct task_struct *dead_task)
> _
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 21/34] x86, pkeys: dump PKRU with other kernel registers
@ 2015-12-08 18:19     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:19 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I'm a bit ambivalent about whether this is needed or not.
>
> Protection Keys never affect kernel mappings.  But, they can
> affect whether the kernel will fault when it touches a user
> mapping.  But, the kernel doesn't touch user mappings without
> some careful choreography and these accesses don't generally
> result in oopses.

Well, if we miss some careful choreography at some place, this
information is going to be helpful.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/arch/x86/kernel/process_64.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff -puN arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps arch/x86/kernel/process_64.c
> --- a/arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps	2015-12-03 16:21:27.874773264 -0800
> +++ b/arch/x86/kernel/process_64.c	2015-12-03 16:21:27.877773400 -0800
> @@ -116,6 +116,8 @@ void __show_regs(struct pt_regs *regs, i
>  	printk(KERN_DEFAULT "DR0: %016lx DR1: %016lx DR2: %016lx\n", d0, d1, d2);
>  	printk(KERN_DEFAULT "DR3: %016lx DR6: %016lx DR7: %016lx\n", d3, d6, d7);
>  
> +	if (boot_cpu_has(X86_FEATURE_OSPKE))
> +		printk(KERN_DEFAULT "PKRU: %08x\n", read_pkru());
>  }
>  
>  void release_thread(struct task_struct *dead_task)
> _
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 22/34] x86, pkeys: dump PTE pkey in /proc/pid/smaps
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 18:20     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:20 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> The protection key can now be just as important as read/write
> permissions on a VMA.  We need some debug mechanism to help
> figure out if it is in play.  smaps seems like a logical
> place to expose it.
> 
> arch/x86/kernel/setup.c is a bit of a weirdo place to put
> this code, but it already had seq_file.h and there was not
> a much better existing place to put it.
> 
> We also use no #ifdef.  If protection keys is .config'd out
> we will get the same function as if we used the weak generic
> function.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 22/34] x86, pkeys: dump PTE pkey in /proc/pid/smaps
@ 2015-12-08 18:20     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:20 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> The protection key can now be just as important as read/write
> permissions on a VMA.  We need some debug mechanism to help
> figure out if it is in play.  smaps seems like a logical
> place to expose it.
> 
> arch/x86/kernel/setup.c is a bit of a weirdo place to put
> this code, but it already had seq_file.h and there was not
> a much better existing place to put it.
> 
> We also use no #ifdef.  If protection keys is .config'd out
> we will get the same function as if we used the weak generic
> function.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 23/34] x86, pkeys: add Kconfig prompt to existing config option
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 18:21     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:21 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> I don't have a strong opinion on whether we need this or not.
> Protection Keys has relatively little code associated with it,
> and it is not a heavyweight feature to keep enabled.  However,
> I can imagine that folks would still appreciate being able to
> disable it.

The tiny kernel folks are happy about every few kB which are NOT
added by default.
 
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

> ---
> 
>  b/arch/x86/Kconfig |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff -puN arch/x86/Kconfig~pkeys-40-kconfig-prompt arch/x86/Kconfig
> --- a/arch/x86/Kconfig~pkeys-40-kconfig-prompt	2015-12-03 16:21:28.726811905 -0800
> +++ b/arch/x86/Kconfig	2015-12-03 16:21:28.730812086 -0800
> @@ -1682,8 +1682,18 @@ config X86_INTEL_MPX
>  	  If unsure, say N.
>  
>  config X86_INTEL_MEMORY_PROTECTION_KEYS
> +	prompt "Intel Memory Protection Keys"
>  	def_bool y
> +	# Note: only available in 64-bit mode
>  	depends on CPU_SUP_INTEL && X86_64
> +	---help---
> +	  Memory Protection Keys provides a mechanism for enforcing
> +	  page-based protections, but without requiring modification of the
> +	  page tables when an application changes protection domains.
> +
> +	  For details, see Documentation/x86/protection-keys.txt
> +
> +	  If unsure, say y.
>  
>  config EFI
>  	bool "EFI runtime service support"
> _
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 23/34] x86, pkeys: add Kconfig prompt to existing config option
@ 2015-12-08 18:21     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:21 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> I don't have a strong opinion on whether we need this or not.
> Protection Keys has relatively little code associated with it,
> and it is not a heavyweight feature to keep enabled.  However,
> I can imagine that folks would still appreciate being able to
> disable it.

The tiny kernel folks are happy about every few kB which are NOT
added by default.
 
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

> ---
> 
>  b/arch/x86/Kconfig |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff -puN arch/x86/Kconfig~pkeys-40-kconfig-prompt arch/x86/Kconfig
> --- a/arch/x86/Kconfig~pkeys-40-kconfig-prompt	2015-12-03 16:21:28.726811905 -0800
> +++ b/arch/x86/Kconfig	2015-12-03 16:21:28.730812086 -0800
> @@ -1682,8 +1682,18 @@ config X86_INTEL_MPX
>  	  If unsure, say N.
>  
>  config X86_INTEL_MEMORY_PROTECTION_KEYS
> +	prompt "Intel Memory Protection Keys"
>  	def_bool y
> +	# Note: only available in 64-bit mode
>  	depends on CPU_SUP_INTEL && X86_64
> +	---help---
> +	  Memory Protection Keys provides a mechanism for enforcing
> +	  page-based protections, but without requiring modification of the
> +	  page tables when an application changes protection domains.
> +
> +	  For details, see Documentation/x86/protection-keys.txt
> +
> +	  If unsure, say y.
>  
>  config EFI
>  	bool "EFI runtime service support"
> _
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/34] x86, pkeys: arch-specific protection bitsy
  2015-12-08 18:06         ` Dave Hansen
@ 2015-12-08 18:29             ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:29 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Tue, 8 Dec 2015, Dave Hansen wrote:

> Here's how it looks with the suggested modifications.
> 
> Whatever compiler wonkiness I was seeing is gone now, so I've used the
> most straightforward version of the shifts.

> +        * gcc generates better code if we do this rather than:
> +        * pkey = (flags & mask) >> shift
> +        */
> +       pkey = (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;

ROTFL!


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/34] x86, pkeys: arch-specific protection bitsy
@ 2015-12-08 18:29             ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:29 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Tue, 8 Dec 2015, Dave Hansen wrote:

> Here's how it looks with the suggested modifications.
> 
> Whatever compiler wonkiness I was seeing is gone now, so I've used the
> most straightforward version of the shifts.

> +        * gcc generates better code if we do this rather than:
> +        * pkey = (flags & mask) >> shift
> +        */
> +       pkey = (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;

ROTFL!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 16/34] x86, mm: simplify get_user_pages() PTE bit handling
  2015-12-08 18:01     ` Thomas Gleixner
@ 2015-12-08 18:30       ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-08 18:30 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On 12/08/2015 10:01 AM, Thomas Gleixner wrote:
> static inline int pte_allows_gup(unsigned long pteval, int write)
> {
> 	unsigned long mask = _PAGE_PRESENT|_PAGE_USER;
> 
> 	if (write)
> 		mask |= _PAGE_RW;
> 
> 	if ((pteval & mask) != mask)
> 		return 0;
> 
> 	if (!__pkru_allows_pkey(pte_flags_pkey(pteval), write))
> 	   	return 0;
> 	return 1;
> }
> 
> and at the callsites do:
> 
>     if (pte_allows_gup(pte_val(pte, write))
> 
>     if (pte_allows_gup(pmd_val(pmd, write))
> 
>     if (pte_allows_gup(pud_val(pud, write))
> 
> Hmm?

Looks fine to me.  I'll do that.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 16/34] x86, mm: simplify get_user_pages() PTE bit handling
@ 2015-12-08 18:30       ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-08 18:30 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On 12/08/2015 10:01 AM, Thomas Gleixner wrote:
> static inline int pte_allows_gup(unsigned long pteval, int write)
> {
> 	unsigned long mask = _PAGE_PRESENT|_PAGE_USER;
> 
> 	if (write)
> 		mask |= _PAGE_RW;
> 
> 	if ((pteval & mask) != mask)
> 		return 0;
> 
> 	if (!__pkru_allows_pkey(pte_flags_pkey(pteval), write))
> 	   	return 0;
> 	return 1;
> }
> 
> and at the callsites do:
> 
>     if (pte_allows_gup(pte_val(pte, write))
> 
>     if (pte_allows_gup(pmd_val(pmd, write))
> 
>     if (pte_allows_gup(pud_val(pud, write))
> 
> Hmm?

Looks fine to me.  I'll do that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/34] x86, pkeys: arch-specific protection bitsy
  2015-12-08 18:29             ` Thomas Gleixner
@ 2015-12-08 18:35               ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:35 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Tue, 8 Dec 2015, Thomas Gleixner wrote:
> On Tue, 8 Dec 2015, Dave Hansen wrote:
> 
> > Here's how it looks with the suggested modifications.
> > 
> > Whatever compiler wonkiness I was seeing is gone now, so I've used the
> > most straightforward version of the shifts.
> 
> > +        * gcc generates better code if we do this rather than:
> > +        * pkey = (flags & mask) >> shift
> > +        */
> > +       pkey = (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
> 
> ROTFL!

Other than that silly comment, it's way better than before.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 10/34] x86, pkeys: arch-specific protection bitsy
@ 2015-12-08 18:35               ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:35 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Tue, 8 Dec 2015, Thomas Gleixner wrote:
> On Tue, 8 Dec 2015, Dave Hansen wrote:
> 
> > Here's how it looks with the suggested modifications.
> > 
> > Whatever compiler wonkiness I was seeing is gone now, so I've used the
> > most straightforward version of the shifts.
> 
> > +        * gcc generates better code if we do this rather than:
> > +        * pkey = (flags & mask) >> shift
> > +        */
> > +       pkey = (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
> 
> ROTFL!

Other than that silly comment, it's way better than before.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 25/34] x86, pkeys: add arch_validate_pkey()
  2015-12-04  1:14   ` Dave Hansen
@ 2015-12-08 18:39     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:39 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> +#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ?      \
> +				CONFIG_NR_PROTECTION_KEYS : 1)

Should this really be a config option? Can't that value change ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 25/34] x86, pkeys: add arch_validate_pkey()
@ 2015-12-08 18:39     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:39 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> +#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ?      \
> +				CONFIG_NR_PROTECTION_KEYS : 1)

Should this really be a config option? Can't that value change ?

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 27/34] x86, pkeys: make mprotect_key() mask off additional vm_flags
  2015-12-04  1:15   ` Dave Hansen
@ 2015-12-08 18:41     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:41 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
> Three of those bits: READ/WRITE/EXEC get translated directly in to
> vma->vm_flags by calc_vm_prot_bits().  If a bit is unset in
> mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
> during the mprotect() call.
> 
> We do the by first calculating the VMA flags we want set, then
> clearing the ones we do not want to inherit from the original VMA:
> 
> 	vm_flags = calc_vm_prot_bits(prot, key);
> 	...
> 	newflags = vm_flags;
> 	newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
> 
> However, we *also* want to mask off the original VMA's vm_flags in
> which we store the protection key.
> 
> To do that, this patch adds a new macro:
> 
> 	ARCH_VM_FLAGS_AFFECTED_BY_MPROTECT

-ENOSUCHMACRO
 
> which allows the architecture to specify additional bits that it would
> like cleared.  We use that to ensure that the VM_PKEY_BIT* bits get
> cleared.

Other than that: Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 27/34] x86, pkeys: make mprotect_key() mask off additional vm_flags
@ 2015-12-08 18:41     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:41 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
> Three of those bits: READ/WRITE/EXEC get translated directly in to
> vma->vm_flags by calc_vm_prot_bits().  If a bit is unset in
> mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
> during the mprotect() call.
> 
> We do the by first calculating the VMA flags we want set, then
> clearing the ones we do not want to inherit from the original VMA:
> 
> 	vm_flags = calc_vm_prot_bits(prot, key);
> 	...
> 	newflags = vm_flags;
> 	newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
> 
> However, we *also* want to mask off the original VMA's vm_flags in
> which we store the protection key.
> 
> To do that, this patch adds a new macro:
> 
> 	ARCH_VM_FLAGS_AFFECTED_BY_MPROTECT

-ENOSUCHMACRO
 
> which allows the architecture to specify additional bits that it would
> like cleared.  We use that to ensure that the VM_PKEY_BIT* bits get
> cleared.

Other than that: Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/34] x86: wire up mprotect_key() system call
  2015-12-04  1:15   ` Dave Hansen
  (?)
@ 2015-12-08 18:44     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:44 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen, linux-api

On Thu, 3 Dec 2015, Dave Hansen wrote:
>  #include <asm-generic/mman.h>
> diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
> --- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.114920208 -0800
> +++ b/mm/Kconfig	2015-12-03 16:21:31.119920435 -0800
> @@ -679,4 +679,5 @@ config NR_PROTECTION_KEYS
>  	# Everything supports a _single_ key, so allow folks to
>  	# at least call APIs that take keys, but require that the
>  	# key be 0.
> +	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
>  	default 1

What happens if I set that to 42?

I think we want to make this a runtime evaluated thingy. If pkeys are
compiled in, but the machine does not support it then we don't support
16 keys, or do we?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/34] x86: wire up mprotect_key() system call
@ 2015-12-08 18:44     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:44 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen, linux-api

On Thu, 3 Dec 2015, Dave Hansen wrote:
>  #include <asm-generic/mman.h>
> diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
> --- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.114920208 -0800
> +++ b/mm/Kconfig	2015-12-03 16:21:31.119920435 -0800
> @@ -679,4 +679,5 @@ config NR_PROTECTION_KEYS
>  	# Everything supports a _single_ key, so allow folks to
>  	# at least call APIs that take keys, but require that the
>  	# key be 0.
> +	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
>  	default 1

What happens if I set that to 42?

I think we want to make this a runtime evaluated thingy. If pkeys are
compiled in, but the machine does not support it then we don't support
16 keys, or do we?

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/34] x86: wire up mprotect_key() system call
@ 2015-12-08 18:44     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, x86-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, 3 Dec 2015, Dave Hansen wrote:
>  #include <asm-generic/mman.h>
> diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
> --- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.114920208 -0800
> +++ b/mm/Kconfig	2015-12-03 16:21:31.119920435 -0800
> @@ -679,4 +679,5 @@ config NR_PROTECTION_KEYS
>  	# Everything supports a _single_ key, so allow folks to
>  	# at least call APIs that take keys, but require that the
>  	# key be 0.
> +	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
>  	default 1

What happens if I set that to 42?

I think we want to make this a runtime evaluated thingy. If pkeys are
compiled in, but the machine does not support it then we don't support
16 keys, or do we?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 29/34] x86: separate out LDT init from context init
  2015-12-04  1:15   ` Dave Hansen
@ 2015-12-08 18:45     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:45 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> The arch-specific mm_context_t is a great place to put
> protection-key allocation state.
> 
> But, we need to initialize the allocation state because pkey 0 is
> always "allocated".  All of the runtime initialization of
> mm_context_t is done in *_ldt() manipulation functions.  This
> renames the existing LDT functions like this:
> 
> 	init_new_context() -> init_new_context_ldt()
> 	destroy_context() -> destroy_context_ldt()
> 
> and makes init_new_context() and destroy_context() available for
> generic use.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 29/34] x86: separate out LDT init from context init
@ 2015-12-08 18:45     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:45 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:
> The arch-specific mm_context_t is a great place to put
> protection-key allocation state.
> 
> But, we need to initialize the allocation state because pkey 0 is
> always "allocated".  All of the runtime initialization of
> mm_context_t is done in *_ldt() manipulation functions.  This
> renames the existing LDT functions like this:
> 
> 	init_new_context() -> init_new_context_ldt()
> 	destroy_context() -> destroy_context_ldt()
> 
> and makes init_new_context() and destroy_context() available for
> generic use.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 30/34] x86, fpu: allow setting of XSAVE state
  2015-12-04  1:15   ` Dave Hansen
@ 2015-12-08 18:48     ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:48 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> We want to modify the Protection Key rights inside the kernel, so
> we need to change PKRU's contents.  But, if we do a plain
> 'wrpkru', when we return to userspace we might do an XRSTOR and
> wipe out the kernel's 'wrpkru'.  So, we need to go after PKRU in
> the xsave buffer.
> 
> We do this by:
> 1. Ensuring that we have the XSAVE registers (fpregs) in the
>    kernel FPU buffer (fpstate)
> 2. Looking up the location of a given state in the buffer
> 3. Filling in the stat
> 4. Ensuring that the hardware knows that state is present there
>    (basically that the 'init optimization' is not in place).
> 5. Copying the newly-modified state back to the registers if
>    necessary.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 30/34] x86, fpu: allow setting of XSAVE state
@ 2015-12-08 18:48     ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 18:48 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen

On Thu, 3 Dec 2015, Dave Hansen wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> We want to modify the Protection Key rights inside the kernel, so
> we need to change PKRU's contents.  But, if we do a plain
> 'wrpkru', when we return to userspace we might do an XRSTOR and
> wipe out the kernel's 'wrpkru'.  So, we need to go after PKRU in
> the xsave buffer.
> 
> We do this by:
> 1. Ensuring that we have the XSAVE registers (fpregs) in the
>    kernel FPU buffer (fpstate)
> 2. Looking up the location of a given state in the buffer
> 3. Filling in the stat
> 4. Ensuring that the hardware knows that state is present there
>    (basically that the 'init optimization' is not in place).
> 5. Copying the newly-modified state back to the registers if
>    necessary.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/34] x86: wire up mprotect_key() system call
  2015-12-08 18:44     ` Thomas Gleixner
@ 2015-12-08 19:06       ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-08 19:06 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, linux-mm, x86, dave.hansen, linux-api

On 12/08/2015 10:44 AM, Thomas Gleixner wrote:
> On Thu, 3 Dec 2015, Dave Hansen wrote:
>>  #include <asm-generic/mman.h>
>> diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
>> --- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.114920208 -0800
>> +++ b/mm/Kconfig	2015-12-03 16:21:31.119920435 -0800
>> @@ -679,4 +679,5 @@ config NR_PROTECTION_KEYS
>>  	# Everything supports a _single_ key, so allow folks to
>>  	# at least call APIs that take keys, but require that the
>>  	# key be 0.
>> +	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
>>  	default 1
> 
> What happens if I set that to 42?
> 
> I think we want to make this a runtime evaluated thingy. If pkeys are
> compiled in, but the machine does not support it then we don't support
> 16 keys, or do we?

We do have runtime evaluation:

#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ?      \
                             CONFIG_NR_PROTECTION_KEYS : 1)

The config option really just sets the architectural limit for how many
are supported.  So it probably needs a better name at least.  Let me
take a look at getting rid of this config option entirely.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/34] x86: wire up mprotect_key() system call
@ 2015-12-08 19:06       ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-08 19:06 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, linux-mm, x86, dave.hansen, linux-api

On 12/08/2015 10:44 AM, Thomas Gleixner wrote:
> On Thu, 3 Dec 2015, Dave Hansen wrote:
>>  #include <asm-generic/mman.h>
>> diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
>> --- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.114920208 -0800
>> +++ b/mm/Kconfig	2015-12-03 16:21:31.119920435 -0800
>> @@ -679,4 +679,5 @@ config NR_PROTECTION_KEYS
>>  	# Everything supports a _single_ key, so allow folks to
>>  	# at least call APIs that take keys, but require that the
>>  	# key be 0.
>> +	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
>>  	default 1
> 
> What happens if I set that to 42?
> 
> I think we want to make this a runtime evaluated thingy. If pkeys are
> compiled in, but the machine does not support it then we don't support
> 16 keys, or do we?

We do have runtime evaluation:

#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ?      \
                             CONFIG_NR_PROTECTION_KEYS : 1)

The config option really just sets the architectural limit for how many
are supported.  So it probably needs a better name at least.  Let me
take a look at getting rid of this config option entirely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/34] x86: wire up mprotect_key() system call
  2015-12-08 19:06       ` Dave Hansen
  (?)
@ 2015-12-08 20:38         ` Thomas Gleixner
  -1 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 20:38 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen, linux-api

On Tue, 8 Dec 2015, Dave Hansen wrote:
> On 12/08/2015 10:44 AM, Thomas Gleixner wrote:
> > On Thu, 3 Dec 2015, Dave Hansen wrote:
> >>  #include <asm-generic/mman.h>
> >> diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
> >> --- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.114920208 -0800
> >> +++ b/mm/Kconfig	2015-12-03 16:21:31.119920435 -0800
> >> @@ -679,4 +679,5 @@ config NR_PROTECTION_KEYS
> >>  	# Everything supports a _single_ key, so allow folks to
> >>  	# at least call APIs that take keys, but require that the
> >>  	# key be 0.
> >> +	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
> >>  	default 1
> > 
> > What happens if I set that to 42?
> > 
> > I think we want to make this a runtime evaluated thingy. If pkeys are
> > compiled in, but the machine does not support it then we don't support
> > 16 keys, or do we?
> 
> We do have runtime evaluation:
> 
> #define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ?      \
>                              CONFIG_NR_PROTECTION_KEYS : 1)
> 
> The config option really just sets the architectural limit for how many
> are supported.  So it probably needs a better name at least.  Let me
> take a look at getting rid of this config option entirely.

Well, it does not set the architectural limit. It sets some random
value which the guy who configures the kernel choses.

The limit we have in the architecture is 16 because we only have 4
bits for it.
 
arch_max_pkey() is architecture specific, so we can make this:

#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1)

And when we magically get more bits in the next century, then '16' can
become a variable or whatever.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/34] x86: wire up mprotect_key() system call
@ 2015-12-08 20:38         ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 20:38 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, x86, dave.hansen, linux-api

On Tue, 8 Dec 2015, Dave Hansen wrote:
> On 12/08/2015 10:44 AM, Thomas Gleixner wrote:
> > On Thu, 3 Dec 2015, Dave Hansen wrote:
> >>  #include <asm-generic/mman.h>
> >> diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
> >> --- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.114920208 -0800
> >> +++ b/mm/Kconfig	2015-12-03 16:21:31.119920435 -0800
> >> @@ -679,4 +679,5 @@ config NR_PROTECTION_KEYS
> >>  	# Everything supports a _single_ key, so allow folks to
> >>  	# at least call APIs that take keys, but require that the
> >>  	# key be 0.
> >> +	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
> >>  	default 1
> > 
> > What happens if I set that to 42?
> > 
> > I think we want to make this a runtime evaluated thingy. If pkeys are
> > compiled in, but the machine does not support it then we don't support
> > 16 keys, or do we?
> 
> We do have runtime evaluation:
> 
> #define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ?      \
>                              CONFIG_NR_PROTECTION_KEYS : 1)
> 
> The config option really just sets the architectural limit for how many
> are supported.  So it probably needs a better name at least.  Let me
> take a look at getting rid of this config option entirely.

Well, it does not set the architectural limit. It sets some random
value which the guy who configures the kernel choses.

The limit we have in the architecture is 16 because we only have 4
bits for it.
 
arch_max_pkey() is architecture specific, so we can make this:

#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1)

And when we magically get more bits in the next century, then '16' can
become a variable or whatever.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 28/34] x86: wire up mprotect_key() system call
@ 2015-12-08 20:38         ` Thomas Gleixner
  0 siblings, 0 replies; 145+ messages in thread
From: Thomas Gleixner @ 2015-12-08 20:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, x86-DgEjT+Ai2ygdnm+yROfE0A,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Tue, 8 Dec 2015, Dave Hansen wrote:
> On 12/08/2015 10:44 AM, Thomas Gleixner wrote:
> > On Thu, 3 Dec 2015, Dave Hansen wrote:
> >>  #include <asm-generic/mman.h>
> >> diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
> >> --- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-12-03 16:21:31.114920208 -0800
> >> +++ b/mm/Kconfig	2015-12-03 16:21:31.119920435 -0800
> >> @@ -679,4 +679,5 @@ config NR_PROTECTION_KEYS
> >>  	# Everything supports a _single_ key, so allow folks to
> >>  	# at least call APIs that take keys, but require that the
> >>  	# key be 0.
> >> +	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
> >>  	default 1
> > 
> > What happens if I set that to 42?
> > 
> > I think we want to make this a runtime evaluated thingy. If pkeys are
> > compiled in, but the machine does not support it then we don't support
> > 16 keys, or do we?
> 
> We do have runtime evaluation:
> 
> #define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ?      \
>                              CONFIG_NR_PROTECTION_KEYS : 1)
> 
> The config option really just sets the architectural limit for how many
> are supported.  So it probably needs a better name at least.  Let me
> take a look at getting rid of this config option entirely.

Well, it does not set the architectural limit. It sets some random
value which the guy who configures the kernel choses.

The limit we have in the architecture is 16 because we only have 4
bits for it.
 
arch_max_pkey() is architecture specific, so we can make this:

#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1)

And when we magically get more bits in the next century, then '16' can
become a variable or whatever.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
  2015-12-07 16:44       ` Dave Hansen
@ 2015-12-09 11:08         ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 145+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-09 11:08 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: mtk.manpages, linux-mm, x86, dave.hansen, linux-api

Hi Dave,

On 7 December 2015 at 17:44, Dave Hansen <dave@sr71.net> wrote:
> On 12/04/2015 10:50 PM, Michael Kerrisk (man-pages) wrote:
>> On 12/04/2015 02:15 AM, Dave Hansen wrote:
>>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>>
>>> mprotect_key() is just like mprotect, except it also takes a
>>> protection key as an argument.  On systems that do not support
>>> protection keys, it still works, but requires that key=0.
>>> Otherwise it does exactly what mprotect does.
>>
>> Is there a man page for this API?
>
> Yep.

Thanks!

> Patch to man-pages source is attached.

Better as inline, for review purposes.

> I actually broke it up in
> to a few separate pages.

Seems the right approach to me.

> I was planning on submitting these after the
> patches themselves go upstream.

Not a good idea. Reading and creating man pages has helped 
me (and others) find a heap of design and implementation
bugs in APIs. Best that that happens before things hit 
upstream.

Would you be willing to revise your man page (and possibly 
your kernel patches) in the light of my comments below?
It would be better to do this sooner than later, since 
I suspect I'll have a few more API comments as I review 
future drafts of the page.

> commit ebb12643876810931ed23992f92b7c77c2c36883
> Author: Dave Hansen <dave.hansen@intel.com>
> Date:   Mon Dec 7 08:42:57 2015 -0800
>
>     pkeys
>
> diff --git a/man2/mprotect.2 b/man2/mprotect.2
> index ae305f6..a3c1e62 100644
> --- a/man2/mprotect.2
> +++ b/man2/mprotect.2
> @@ -38,16 +38,19 @@
>  .\"
>  .TH MPROTECT 2 2015-07-23 "Linux" "Linux Programmer's Manual"
>  .SH NAME
> -mprotect \- set protection on a region of memory
> +mprotect, mprotect_key \- set protection on a region of memory

Elsewhere in your patch series (in a mail with subject 
"mm: implement new mprotect_key() system call") I see:

+SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
+               unsigned long, prot, int, pkey)
+{
+       if (!arch_validate_pkey(pkey))
+               return -EINVAL;
+
+       return do_mprotect_pkey(start, len, prot, pkey);
+}

And lower down in this patch series, I see "mprotect_pkey"!

What is the name of this system call supposed to be?

For what it's worth, I think "mprotect_pkey()" is the best 
name (and secretly, you seem to as well, since we have at 
the bottom of it all the internal function "do_mprotect_pkey()". 
It signifies that this is a modified version of the base 
functionality provided my mprotect(), and "pkey" is 
consistent with the remainder of the APIs.

But, whatever name you do choose, please fix it in all 
of your commit messages, otherwise reading the git 
history gets very confusing.

>  .SH SYNOPSIS
>  .nf
>  .B #include <sys/mman.h>
>  .sp
>  .BI "int mprotect(void *" addr ", size_t " len ", int " prot );
> +.BI "int mprotect_key(void *" addr ", size_t " len ", int " prot , " int " key);
>  .fi
>  .SH DESCRIPTION
>  .BR mprotect ()
> -changes protection for the calling process's memory page(s)
> +and
> +.BR mprotect_key ()
> +change protection for the calling process's memory page(s)
>  containing any part of the address range in the
>  interval [\fIaddr\fP,\ \fIaddr\fP+\fIlen\fP\-1].
>  .I addr
> @@ -74,10 +77,17 @@ The memory can be modified.
>  .TP
>  .B PROT_EXEC
>  The memory can be executed.
> +.PP
> +.I key
> +is the protection or storage key to assign to the memory.

Why "protection or storage key" here? This phrasing seems a
little ambiguous to me, given that we also have a 'prot'
argument.  I think it would be clearer just to say 
"protection key". But maybe I'm missing something.

> +A key must be allocated with pkey_alloc () before it is

Please format syscall cross references as

.BR pkey_alloc (2)

> +passed to pkey_mprotect ().
>  .SH RETURN VALUE
>  On success,
>  .BR mprotect ()
> -returns zero.
> +and
> +.BR mprotect_key ()
> +return zero.
>  On error, \-1 is returned, and
>  .I errno
>  is set appropriately.

Are there no errors specific to mprotect_key()? Is there
an error if pkey is invalid? I see now that there is. That
EINVAL error needs documenting.

> diff --git a/man2/pkey_alloc.2 b/man2/pkey_alloc.2
> new file mode 100644
> index 0000000..980ce3e
> --- /dev/null
> +++ b/man2/pkey_alloc.2
> @@ -0,0 +1,72 @@
> +.\" Copyright (C) 2007 Michael Kerrisk <mtk.manpages@gmail.com>
> +.\" and Copyright (C) 1995 Michael Shields <shields@tembel.org>.

Michaels have many talents, but  documenting kernel APIs 
20 years ahead of their creation is not one of them, I believe. 
Better replace this with the actual copyright holder and author
name.

> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and author of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\" Modified 2015-12-04 by Dave Hansen <dave@sr71.net>

This info should be in the copyright notice above.

> +.\"
> +.\"
> +.TH PKEY_ALLOC 2 2015-12-04 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +pkey_alloc, pkey_free \- allocate or free a protection key
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/mman.h>
> +.sp
> +.BI "int pkey_alloc(unsigned long" flags ", unsigned long " init_val);

If I understand correctly, 'init_val' is a mask of access rights 
as per pkey_set(). If so, let's name the argument in this man page 
"access_rights" also (or perhaps "init_access_rights" if you must, 
but I think the shorter name is better. This helps the reader 
to understand that we're talking about the same thing. It would be 
good also to make the same change in the kernel code.

> +.BI "int pkey_free(int " pkey);
> +.fi
> +.SH DESCRIPTION
> +.BR pkey_alloc ()
> +and
> +.BR pkey_free ()
> +allow or disallow the calling process's to use the given

s/process's/process/
But should this actually be "thread"?

> +protection key for all protection-key-related operations.
> +
> +.PP
> +.I flags
> +is may contain zero or more disable operation:

s/is may/may/

> +.B PKEY_DISABLE_ACCESS
> +and/or
> +.B PKEY_DISABLE_WRITE

For the above two, please format as
.TP
.B PKEY_DISABLE_ACCESS
<Explanation of this flag>
.TP
.B PKEY_DISABLE_WRITE
<Explanation of this flag>

> +.SH RETURN VALUE
> +On success,
> +.BR pkey_alloc ()
> +and
> +.BR pkey_free ()
> +return zero.

The description of the success return for pkey_alloc() can't 
be right. Doesn't it return a protection key?

> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +An invalid protection key, flag, or init_val was specified.

Better to write that last line as:

[[
.IR pkey ,
.IR flags ,
or
.I init_val                    [Or: access_rights]
is invalid.
]]

> +.TP
> +.B ENOSPC
> +All protection keys available for the current process have
> +been allocated.

So it seems to me that this page needs a discussion of the
limit that is involved here.

> +.SH SEE ALSO
> +.BR mprotect_pkey (2),
> +.BR pkey_get (2),
> +.BR pkey_set (2),

Remove trailing comma.

> diff --git a/man2/pkey_get.2 b/man2/pkey_get.2
> new file mode 100644
> index 0000000..4cfdea9
> --- /dev/null
> +++ b/man2/pkey_get.2
> @@ -0,0 +1,76 @@
> +.\" Copyright (C) 2007 Michael Kerrisk <mtk.manpages@gmail.com>
> +.\" and Copyright (C) 1995 Michael Shields <shields@tembel.org>.

Again, the copyright notice needs fixing.

> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and author of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\" Modified 2015-12-04 by Dave Hansen <dave@sr71.net>
> +.\"
> +.\"
> +.TH PKEY_GET 2 2015-12-04 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +pkey_get, pkey_set \- manage protection key access permissions
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/mman.h>
> +.sp
> +.BI "int pkey_get(int " pkey);
> +.BI "int pkey_set(int " pkey ", unsigned long " access_rights);
> +.fi
> +.SH DESCRIPTION
> +.BR pkey_get ()
> +and
> +.BR pkey_set ()
> +query or set the current set of rights for the calling
> +task for the given protection key.

Change "task" to "thread".

> +When rights for a key are disabled, any future access
> +to any memory region with that key set will generate
> +a SIGSEGV.  The rights are local to the calling thread and
> +do not affect any other threads.

I think the last sentence could be simpler ("Access rights are 
private to each thread."), or even removed, since you already 
say above that these operations are per task (should be "per thread").

> +.PP
> +Upon entering any signal handler, the process is given a
> +default set of protection key rights which are separate from
> +the main thread's.  Any calls to pkey_set () in a signal

s/signal/signal hander/

Format the reference to this system call as:

.BR pkey_set ()

> +will not persist upon a return to the calling process.

So, the preceding paragraph leaves me confused. And I'm wondering 
if that confusion reflects some weirdness in the API design. But
I can't tell until I understand it better. These are my problems:

* You throw "process" and "thread" together in the explanation. 
  Is this simply a mistake? If it is not, the distinction 
  you are trying to draw by using the two different terms 
  is not made clear in the text.

* Your text ("separate from the main thread's") makes it
  sound as though a signal handler is somehow invoked in a
  different thread, which makes no sense. I suspect you
  want to say something like this:

  [[
  When a signal handler is invoked, the thread is temporarily
  given a default set of protection key rights. The thread's 
  protection key rights are restored when the signal handler 
  returns.
  ]]

  Is that close to the truth?

* Change "a return to the calling process" to "when the
  signal handler returns". Signal handlers are not "called"
  by the program.

* There needs to be some explanation in this page of *why*
  this special behavior occurs when signal handlers are
  invoked.

And I have a question (and the answer probably should 
be documented in the manual page).  What happens when 
one signal handler interrupts the execution of another? 
Do pkey_set() calls in the first handler persist into the 
second handler? I presume not, but it would be good to 
be a little more explicit about this.

> +.PP
> +.I access_rights
> +is may contain zero or more disable operation:

s/is may/may/

> +.B PKEY_DISABLE_ACCESS
> +and/or
> +.B PKEY_DISABLE_WRITE

For the above two, please format as

[[
.TP
.B PKEY_DISABLE_ACCESS
<Explanation of this flag>
.TP
.B PKEY_DISABLE_WRITE
<Explanation of this flag>
]]

In various commit messages you use two alternative names:
PKEY_DENY_ACCESS and PKEY_DENY_WRITE. I assume bit rot here 
as the the API has evolved. But please fix all of those
commit messages, so that the git history is more sensible.

> +.SH RETURN VALUE
> +On success,
> +.BR pkey_get ()
> +and
> +.BR pkey_set ()
> +return zero.

The success return value of pkey_get() is not correct.
Doesn't it return an access rights mask?

> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +An invalid protection key or access_rights was specified.

[[
.I access_rights
or
.I pkey
is invalid.
]]

> +.SH SEE ALSO
> +.BR mprotect_pkey (2),
> +.BR pkey_alloc (2),
> +.BR pkey_free (2),

Remove trailing comma.

So at the end of reading these pages, and delving
a little through the commit messages, I still don't
feel convinced that I understand what these APIs are
about. There's several things that I think still need 
to be added to these pages:

* A general overview of why this functionality is useful.
* A note on which architectures support/will support
  this functionality.
* Explanation of what a protection domain is.
* Explanation of how a process (thread?) changes its
  protection domain.
* Explanation of the relationship between page permission
  bits (PROT_READ/PROT_WRITE/PROTE_EXEC) and 
  PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE.
  It's still not clear to me. Do the PKEY_* bits
  override the PROT_* bits. Or, something else?

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-09 11:08         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 145+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-09 11:08 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: mtk.manpages, linux-mm, x86, dave.hansen, linux-api

Hi Dave,

On 7 December 2015 at 17:44, Dave Hansen <dave@sr71.net> wrote:
> On 12/04/2015 10:50 PM, Michael Kerrisk (man-pages) wrote:
>> On 12/04/2015 02:15 AM, Dave Hansen wrote:
>>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>>
>>> mprotect_key() is just like mprotect, except it also takes a
>>> protection key as an argument.  On systems that do not support
>>> protection keys, it still works, but requires that key=0.
>>> Otherwise it does exactly what mprotect does.
>>
>> Is there a man page for this API?
>
> Yep.

Thanks!

> Patch to man-pages source is attached.

Better as inline, for review purposes.

> I actually broke it up in
> to a few separate pages.

Seems the right approach to me.

> I was planning on submitting these after the
> patches themselves go upstream.

Not a good idea. Reading and creating man pages has helped 
me (and others) find a heap of design and implementation
bugs in APIs. Best that that happens before things hit 
upstream.

Would you be willing to revise your man page (and possibly 
your kernel patches) in the light of my comments below?
It would be better to do this sooner than later, since 
I suspect I'll have a few more API comments as I review 
future drafts of the page.

> commit ebb12643876810931ed23992f92b7c77c2c36883
> Author: Dave Hansen <dave.hansen@intel.com>
> Date:   Mon Dec 7 08:42:57 2015 -0800
>
>     pkeys
>
> diff --git a/man2/mprotect.2 b/man2/mprotect.2
> index ae305f6..a3c1e62 100644
> --- a/man2/mprotect.2
> +++ b/man2/mprotect.2
> @@ -38,16 +38,19 @@
>  .\"
>  .TH MPROTECT 2 2015-07-23 "Linux" "Linux Programmer's Manual"
>  .SH NAME
> -mprotect \- set protection on a region of memory
> +mprotect, mprotect_key \- set protection on a region of memory

Elsewhere in your patch series (in a mail with subject 
"mm: implement new mprotect_key() system call") I see:

+SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
+               unsigned long, prot, int, pkey)
+{
+       if (!arch_validate_pkey(pkey))
+               return -EINVAL;
+
+       return do_mprotect_pkey(start, len, prot, pkey);
+}

And lower down in this patch series, I see "mprotect_pkey"!

What is the name of this system call supposed to be?

For what it's worth, I think "mprotect_pkey()" is the best 
name (and secretly, you seem to as well, since we have at 
the bottom of it all the internal function "do_mprotect_pkey()". 
It signifies that this is a modified version of the base 
functionality provided my mprotect(), and "pkey" is 
consistent with the remainder of the APIs.

But, whatever name you do choose, please fix it in all 
of your commit messages, otherwise reading the git 
history gets very confusing.

>  .SH SYNOPSIS
>  .nf
>  .B #include <sys/mman.h>
>  .sp
>  .BI "int mprotect(void *" addr ", size_t " len ", int " prot );
> +.BI "int mprotect_key(void *" addr ", size_t " len ", int " prot , " int " key);
>  .fi
>  .SH DESCRIPTION
>  .BR mprotect ()
> -changes protection for the calling process's memory page(s)
> +and
> +.BR mprotect_key ()
> +change protection for the calling process's memory page(s)
>  containing any part of the address range in the
>  interval [\fIaddr\fP,\ \fIaddr\fP+\fIlen\fP\-1].
>  .I addr
> @@ -74,10 +77,17 @@ The memory can be modified.
>  .TP
>  .B PROT_EXEC
>  The memory can be executed.
> +.PP
> +.I key
> +is the protection or storage key to assign to the memory.

Why "protection or storage key" here? This phrasing seems a
little ambiguous to me, given that we also have a 'prot'
argument.  I think it would be clearer just to say 
"protection key". But maybe I'm missing something.

> +A key must be allocated with pkey_alloc () before it is

Please format syscall cross references as

.BR pkey_alloc (2)

> +passed to pkey_mprotect ().
>  .SH RETURN VALUE
>  On success,
>  .BR mprotect ()
> -returns zero.
> +and
> +.BR mprotect_key ()
> +return zero.
>  On error, \-1 is returned, and
>  .I errno
>  is set appropriately.

Are there no errors specific to mprotect_key()? Is there
an error if pkey is invalid? I see now that there is. That
EINVAL error needs documenting.

> diff --git a/man2/pkey_alloc.2 b/man2/pkey_alloc.2
> new file mode 100644
> index 0000000..980ce3e
> --- /dev/null
> +++ b/man2/pkey_alloc.2
> @@ -0,0 +1,72 @@
> +.\" Copyright (C) 2007 Michael Kerrisk <mtk.manpages@gmail.com>
> +.\" and Copyright (C) 1995 Michael Shields <shields@tembel.org>.

Michaels have many talents, but  documenting kernel APIs 
20 years ahead of their creation is not one of them, I believe. 
Better replace this with the actual copyright holder and author
name.

> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and author of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\" Modified 2015-12-04 by Dave Hansen <dave@sr71.net>

This info should be in the copyright notice above.

> +.\"
> +.\"
> +.TH PKEY_ALLOC 2 2015-12-04 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +pkey_alloc, pkey_free \- allocate or free a protection key
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/mman.h>
> +.sp
> +.BI "int pkey_alloc(unsigned long" flags ", unsigned long " init_val);

If I understand correctly, 'init_val' is a mask of access rights 
as per pkey_set(). If so, let's name the argument in this man page 
"access_rights" also (or perhaps "init_access_rights" if you must, 
but I think the shorter name is better. This helps the reader 
to understand that we're talking about the same thing. It would be 
good also to make the same change in the kernel code.

> +.BI "int pkey_free(int " pkey);
> +.fi
> +.SH DESCRIPTION
> +.BR pkey_alloc ()
> +and
> +.BR pkey_free ()
> +allow or disallow the calling process's to use the given

s/process's/process/
But should this actually be "thread"?

> +protection key for all protection-key-related operations.
> +
> +.PP
> +.I flags
> +is may contain zero or more disable operation:

s/is may/may/

> +.B PKEY_DISABLE_ACCESS
> +and/or
> +.B PKEY_DISABLE_WRITE

For the above two, please format as
.TP
.B PKEY_DISABLE_ACCESS
<Explanation of this flag>
.TP
.B PKEY_DISABLE_WRITE
<Explanation of this flag>

> +.SH RETURN VALUE
> +On success,
> +.BR pkey_alloc ()
> +and
> +.BR pkey_free ()
> +return zero.

The description of the success return for pkey_alloc() can't 
be right. Doesn't it return a protection key?

> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +An invalid protection key, flag, or init_val was specified.

Better to write that last line as:

[[
.IR pkey ,
.IR flags ,
or
.I init_val                    [Or: access_rights]
is invalid.
]]

> +.TP
> +.B ENOSPC
> +All protection keys available for the current process have
> +been allocated.

So it seems to me that this page needs a discussion of the
limit that is involved here.

> +.SH SEE ALSO
> +.BR mprotect_pkey (2),
> +.BR pkey_get (2),
> +.BR pkey_set (2),

Remove trailing comma.

> diff --git a/man2/pkey_get.2 b/man2/pkey_get.2
> new file mode 100644
> index 0000000..4cfdea9
> --- /dev/null
> +++ b/man2/pkey_get.2
> @@ -0,0 +1,76 @@
> +.\" Copyright (C) 2007 Michael Kerrisk <mtk.manpages@gmail.com>
> +.\" and Copyright (C) 1995 Michael Shields <shields@tembel.org>.

Again, the copyright notice needs fixing.

> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and author of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\" Modified 2015-12-04 by Dave Hansen <dave@sr71.net>
> +.\"
> +.\"
> +.TH PKEY_GET 2 2015-12-04 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +pkey_get, pkey_set \- manage protection key access permissions
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/mman.h>
> +.sp
> +.BI "int pkey_get(int " pkey);
> +.BI "int pkey_set(int " pkey ", unsigned long " access_rights);
> +.fi
> +.SH DESCRIPTION
> +.BR pkey_get ()
> +and
> +.BR pkey_set ()
> +query or set the current set of rights for the calling
> +task for the given protection key.

Change "task" to "thread".

> +When rights for a key are disabled, any future access
> +to any memory region with that key set will generate
> +a SIGSEGV.  The rights are local to the calling thread and
> +do not affect any other threads.

I think the last sentence could be simpler ("Access rights are 
private to each thread."), or even removed, since you already 
say above that these operations are per task (should be "per thread").

> +.PP
> +Upon entering any signal handler, the process is given a
> +default set of protection key rights which are separate from
> +the main thread's.  Any calls to pkey_set () in a signal

s/signal/signal hander/

Format the reference to this system call as:

.BR pkey_set ()

> +will not persist upon a return to the calling process.

So, the preceding paragraph leaves me confused. And I'm wondering 
if that confusion reflects some weirdness in the API design. But
I can't tell until I understand it better. These are my problems:

* You throw "process" and "thread" together in the explanation. 
  Is this simply a mistake? If it is not, the distinction 
  you are trying to draw by using the two different terms 
  is not made clear in the text.

* Your text ("separate from the main thread's") makes it
  sound as though a signal handler is somehow invoked in a
  different thread, which makes no sense. I suspect you
  want to say something like this:

  [[
  When a signal handler is invoked, the thread is temporarily
  given a default set of protection key rights. The thread's 
  protection key rights are restored when the signal handler 
  returns.
  ]]

  Is that close to the truth?

* Change "a return to the calling process" to "when the
  signal handler returns". Signal handlers are not "called"
  by the program.

* There needs to be some explanation in this page of *why*
  this special behavior occurs when signal handlers are
  invoked.

And I have a question (and the answer probably should 
be documented in the manual page).  What happens when 
one signal handler interrupts the execution of another? 
Do pkey_set() calls in the first handler persist into the 
second handler? I presume not, but it would be good to 
be a little more explicit about this.

> +.PP
> +.I access_rights
> +is may contain zero or more disable operation:

s/is may/may/

> +.B PKEY_DISABLE_ACCESS
> +and/or
> +.B PKEY_DISABLE_WRITE

For the above two, please format as

[[
.TP
.B PKEY_DISABLE_ACCESS
<Explanation of this flag>
.TP
.B PKEY_DISABLE_WRITE
<Explanation of this flag>
]]

In various commit messages you use two alternative names:
PKEY_DENY_ACCESS and PKEY_DENY_WRITE. I assume bit rot here 
as the the API has evolved. But please fix all of those
commit messages, so that the git history is more sensible.

> +.SH RETURN VALUE
> +On success,
> +.BR pkey_get ()
> +and
> +.BR pkey_set ()
> +return zero.

The success return value of pkey_get() is not correct.
Doesn't it return an access rights mask?

> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +An invalid protection key or access_rights was specified.

[[
.I access_rights
or
.I pkey
is invalid.
]]

> +.SH SEE ALSO
> +.BR mprotect_pkey (2),
> +.BR pkey_alloc (2),
> +.BR pkey_free (2),

Remove trailing comma.

So at the end of reading these pages, and delving
a little through the commit messages, I still don't
feel convinced that I understand what these APIs are
about. There's several things that I think still need 
to be added to these pages:

* A general overview of why this functionality is useful.
* A note on which architectures support/will support
  this functionality.
* Explanation of what a protection domain is.
* Explanation of how a process (thread?) changes its
  protection domain.
* Explanation of the relationship between page permission
  bits (PROT_READ/PROT_WRITE/PROTE_EXEC) and 
  PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE.
  It's still not clear to me. Do the PKEY_* bits
  override the PROT_* bits. Or, something else?

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
  2015-12-09 11:08         ` Michael Kerrisk (man-pages)
@ 2015-12-09 15:48           ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-09 15:48 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), linux-kernel
  Cc: linux-mm, x86, dave.hansen, linux-api

Hi Michael,

Thanks for all the comments!  I'll fix most of it when I post a new
version of the manpage, but I have a few general questions.

On 12/09/2015 03:08 AM, Michael Kerrisk (man-pages) wrote:
>>
>> +is the protection or storage key to assign to the memory.
> 
> Why "protection or storage key" here? This phrasing seems a
> little ambiguous to me, given that we also have a 'prot'
> argument.  I think it would be clearer just to say 
> "protection key". But maybe I'm missing something.

x86 calls it a "protection key" while powerpc calls it a "storage key".
 They're called "protection keys" consistently inside the kernel.

Should we just stick to one name in the manpages?

> * A general overview of why this functionality is useful.

Any preference on a central spot to do the general overview?  Does it go
in one of the manpages I'm already modifying, or a new one?

> * A note on which architectures support/will support
>   this functionality.

x86 only for now.  We might get powerpc support down the road somewhere.

> * Explanation of what a protection domain is.

A protection domain is a unique view of memory and is represented by the
value in the PKRU register.

> * Explanation of how a process (thread?) changes its
>   protection domain.

Changing protection domains is done by pkey_set() system call, or by
using the WRPKRU instruction.  The system call is preferred and less
error-prone since it enforces that a protection is allocated before its
access protection can be modified.

> * Explanation of the relationship between page permission
>   bits (PROT_READ/PROT_WRITE/PROTE_EXEC) and 
>   PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE.
>   It's still not clear to me. Do the PKEY_* bits
>   override the PROT_* bits. Or, something else?

Protection keys add access restrictions in addition to existing page
permissions.  They can only take away access; they never grant
additional access.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-09 15:48           ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-09 15:48 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), linux-kernel
  Cc: linux-mm, x86, dave.hansen, linux-api

Hi Michael,

Thanks for all the comments!  I'll fix most of it when I post a new
version of the manpage, but I have a few general questions.

On 12/09/2015 03:08 AM, Michael Kerrisk (man-pages) wrote:
>>
>> +is the protection or storage key to assign to the memory.
> 
> Why "protection or storage key" here? This phrasing seems a
> little ambiguous to me, given that we also have a 'prot'
> argument.  I think it would be clearer just to say 
> "protection key". But maybe I'm missing something.

x86 calls it a "protection key" while powerpc calls it a "storage key".
 They're called "protection keys" consistently inside the kernel.

Should we just stick to one name in the manpages?

> * A general overview of why this functionality is useful.

Any preference on a central spot to do the general overview?  Does it go
in one of the manpages I'm already modifying, or a new one?

> * A note on which architectures support/will support
>   this functionality.

x86 only for now.  We might get powerpc support down the road somewhere.

> * Explanation of what a protection domain is.

A protection domain is a unique view of memory and is represented by the
value in the PKRU register.

> * Explanation of how a process (thread?) changes its
>   protection domain.

Changing protection domains is done by pkey_set() system call, or by
using the WRPKRU instruction.  The system call is preferred and less
error-prone since it enforces that a protection is allocated before its
access protection can be modified.

> * Explanation of the relationship between page permission
>   bits (PROT_READ/PROT_WRITE/PROTE_EXEC) and 
>   PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE.
>   It's still not clear to me. Do the PKEY_* bits
>   override the PROT_* bits. Or, something else?

Protection keys add access restrictions in addition to existing page
permissions.  They can only take away access; they never grant
additional access.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
  2015-12-09 15:48           ` Dave Hansen
  (?)
@ 2015-12-09 16:45             ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 145+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-09 16:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: lkml, linux-mm@kvack.org, x86@kernel.org, dave.hansen, Linux API

Hi Dave,

On 9 December 2015 at 16:48, Dave Hansen <dave@sr71.net> wrote:
> Hi Michael,
>
> Thanks for all the comments!  I'll fix most of it when I post a new
> version of the manpage, but I have a few general questions.
>
> On 12/09/2015 03:08 AM, Michael Kerrisk (man-pages) wrote:
>>>
>>> +is the protection or storage key to assign to the memory.
>>
>> Why "protection or storage key" here? This phrasing seems a
>> little ambiguous to me, given that we also have a 'prot'
>> argument.  I think it would be clearer just to say
>> "protection key". But maybe I'm missing something.
>
> x86 calls it a "protection key" while powerpc calls it a "storage key".
>  They're called "protection keys" consistently inside the kernel.
>
> Should we just stick to one name in the manpages?

Yes. But perhaps you could note the alternate name in the pkey(7) page.

>> * A general overview of why this functionality is useful.
>
> Any preference on a central spot to do the general overview?  Does it go
> in one of the manpages I'm already modifying, or a new one?

How about we add one more page, pkey(7) that gives the overview and
also summarizes the APIs.

>> * A note on which architectures support/will support
>>   this functionality.
>
> x86 only for now.  We might get powerpc support down the road somewhere.

Supported architectures can be listed in pkey(7).

>> * Explanation of what a protection domain is.
>
> A protection domain is a unique view of memory and is represented by the
> value in the PKRU register.

Out something about this in pkey(7), but explain what you mean by a
"unique view of memory".

>> * Explanation of how a process (thread?) changes its
>>   protection domain.
>
> Changing protection domains is done by pkey_set() system call, or by
> using the WRPKRU instruction.  The system call is preferred and less
> error-prone since it enforces that a protection is allocated before its
> access protection can be modified.

Details (perhaps not the WRPKRU bit) that should go in pkey(7).

>> * Explanation of the relationship between page permission
>>   bits (PROT_READ/PROT_WRITE/PROTE_EXEC) and
>>   PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE.
>>   It's still not clear to me. Do the PKEY_* bits
>>   override the PROT_* bits. Or, something else?
>
> Protection keys add access restrictions in addition to existing page
> permissions.  They can only take away access; they never grant
> additional access.

This belongs in pkey(7) :-).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-09 16:45             ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 145+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-09 16:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: lkml, linux-mm@kvack.org, x86@kernel.org, dave.hansen, Linux API

Hi Dave,

On 9 December 2015 at 16:48, Dave Hansen <dave@sr71.net> wrote:
> Hi Michael,
>
> Thanks for all the comments!  I'll fix most of it when I post a new
> version of the manpage, but I have a few general questions.
>
> On 12/09/2015 03:08 AM, Michael Kerrisk (man-pages) wrote:
>>>
>>> +is the protection or storage key to assign to the memory.
>>
>> Why "protection or storage key" here? This phrasing seems a
>> little ambiguous to me, given that we also have a 'prot'
>> argument.  I think it would be clearer just to say
>> "protection key". But maybe I'm missing something.
>
> x86 calls it a "protection key" while powerpc calls it a "storage key".
>  They're called "protection keys" consistently inside the kernel.
>
> Should we just stick to one name in the manpages?

Yes. But perhaps you could note the alternate name in the pkey(7) page.

>> * A general overview of why this functionality is useful.
>
> Any preference on a central spot to do the general overview?  Does it go
> in one of the manpages I'm already modifying, or a new one?

How about we add one more page, pkey(7) that gives the overview and
also summarizes the APIs.

>> * A note on which architectures support/will support
>>   this functionality.
>
> x86 only for now.  We might get powerpc support down the road somewhere.

Supported architectures can be listed in pkey(7).

>> * Explanation of what a protection domain is.
>
> A protection domain is a unique view of memory and is represented by the
> value in the PKRU register.

Out something about this in pkey(7), but explain what you mean by a
"unique view of memory".

>> * Explanation of how a process (thread?) changes its
>>   protection domain.
>
> Changing protection domains is done by pkey_set() system call, or by
> using the WRPKRU instruction.  The system call is preferred and less
> error-prone since it enforces that a protection is allocated before its
> access protection can be modified.

Details (perhaps not the WRPKRU bit) that should go in pkey(7).

>> * Explanation of the relationship between page permission
>>   bits (PROT_READ/PROT_WRITE/PROTE_EXEC) and
>>   PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE.
>>   It's still not clear to me. Do the PKEY_* bits
>>   override the PROT_* bits. Or, something else?
>
> Protection keys add access restrictions in addition to existing page
> permissions.  They can only take away access; they never grant
> additional access.

This belongs in pkey(7) :-).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-09 16:45             ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 145+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-09 16:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: lkml, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, Linux API

Hi Dave,

On 9 December 2015 at 16:48, Dave Hansen <dave-gkUM19QKKo4@public.gmane.org> wrote:
> Hi Michael,
>
> Thanks for all the comments!  I'll fix most of it when I post a new
> version of the manpage, but I have a few general questions.
>
> On 12/09/2015 03:08 AM, Michael Kerrisk (man-pages) wrote:
>>>
>>> +is the protection or storage key to assign to the memory.
>>
>> Why "protection or storage key" here? This phrasing seems a
>> little ambiguous to me, given that we also have a 'prot'
>> argument.  I think it would be clearer just to say
>> "protection key". But maybe I'm missing something.
>
> x86 calls it a "protection key" while powerpc calls it a "storage key".
>  They're called "protection keys" consistently inside the kernel.
>
> Should we just stick to one name in the manpages?

Yes. But perhaps you could note the alternate name in the pkey(7) page.

>> * A general overview of why this functionality is useful.
>
> Any preference on a central spot to do the general overview?  Does it go
> in one of the manpages I'm already modifying, or a new one?

How about we add one more page, pkey(7) that gives the overview and
also summarizes the APIs.

>> * A note on which architectures support/will support
>>   this functionality.
>
> x86 only for now.  We might get powerpc support down the road somewhere.

Supported architectures can be listed in pkey(7).

>> * Explanation of what a protection domain is.
>
> A protection domain is a unique view of memory and is represented by the
> value in the PKRU register.

Out something about this in pkey(7), but explain what you mean by a
"unique view of memory".

>> * Explanation of how a process (thread?) changes its
>>   protection domain.
>
> Changing protection domains is done by pkey_set() system call, or by
> using the WRPKRU instruction.  The system call is preferred and less
> error-prone since it enforces that a protection is allocated before its
> access protection can be modified.

Details (perhaps not the WRPKRU bit) that should go in pkey(7).

>> * Explanation of the relationship between page permission
>>   bits (PROT_READ/PROT_WRITE/PROTE_EXEC) and
>>   PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE.
>>   It's still not clear to me. Do the PKEY_* bits
>>   override the PROT_* bits. Or, something else?
>
> Protection keys add access restrictions in addition to existing page
> permissions.  They can only take away access; they never grant
> additional access.

This belongs in pkey(7) :-).

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
  2015-12-09 16:45             ` Michael Kerrisk (man-pages)
  (?)
@ 2015-12-09 17:05               ` Dave Hansen
  -1 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-09 17:05 UTC (permalink / raw)
  To: mtk.manpages
  Cc: lkml, linux-mm@kvack.org, x86@kernel.org, dave.hansen, Linux API

On 12/09/2015 08:45 AM, Michael Kerrisk (man-pages) wrote:
>>> >> * Explanation of what a protection domain is.
>> >
>> > A protection domain is a unique view of memory and is represented by the
>> > value in the PKRU register.
> Out something about this in pkey(7), but explain what you mean by a
> "unique view of memory".

Let's say there are only two protection keys: 0 and 1.  There are two
disable bits per protection key (Access and Write Disable), so a two-key
PKRU looks like:

|   PKEY0   |   PKEY1   |
| AD0 | WD0 | AD1 | WD1 |

In this example, there are 16 possible protection domains, one for each
possible combination of the 4 rights-disable bits.

"Changing a protection domain" would mean changing (setting or clearing)
the value of any of those 4 bits.  Each unique value of PKRU represents
a view of memory, or unique protection domain.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-09 17:05               ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-09 17:05 UTC (permalink / raw)
  To: mtk.manpages
  Cc: lkml, linux-mm@kvack.org, x86@kernel.org, dave.hansen, Linux API

On 12/09/2015 08:45 AM, Michael Kerrisk (man-pages) wrote:
>>> >> * Explanation of what a protection domain is.
>> >
>> > A protection domain is a unique view of memory and is represented by the
>> > value in the PKRU register.
> Out something about this in pkey(7), but explain what you mean by a
> "unique view of memory".

Let's say there are only two protection keys: 0 and 1.  There are two
disable bits per protection key (Access and Write Disable), so a two-key
PKRU looks like:

|   PKEY0   |   PKEY1   |
| AD0 | WD0 | AD1 | WD1 |

In this example, there are 16 possible protection domains, one for each
possible combination of the 4 rights-disable bits.

"Changing a protection domain" would mean changing (setting or clearing)
the value of any of those 4 bits.  Each unique value of PKRU represents
a view of memory, or unique protection domain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-09 17:05               ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2015-12-09 17:05 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: lkml, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA, Linux API

On 12/09/2015 08:45 AM, Michael Kerrisk (man-pages) wrote:
>>> >> * Explanation of what a protection domain is.
>> >
>> > A protection domain is a unique view of memory and is represented by the
>> > value in the PKRU register.
> Out something about this in pkey(7), but explain what you mean by a
> "unique view of memory".

Let's say there are only two protection keys: 0 and 1.  There are two
disable bits per protection key (Access and Write Disable), so a two-key
PKRU looks like:

|   PKEY0   |   PKEY1   |
| AD0 | WD0 | AD1 | WD1 |

In this example, there are 16 possible protection domains, one for each
possible combination of the 4 rights-disable bits.

"Changing a protection domain" would mean changing (setting or clearing)
the value of any of those 4 bits.  Each unique value of PKRU represents
a view of memory, or unique protection domain.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
  2015-12-09 17:05               ` Dave Hansen
@ 2015-12-11 20:13                 ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 145+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-11 20:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: mtk.manpages, lkml, linux-mm@kvack.org, x86@kernel.org,
	dave.hansen, Linux API

On 12/09/2015 06:05 PM, Dave Hansen wrote:
> On 12/09/2015 08:45 AM, Michael Kerrisk (man-pages) wrote:
>>>>>> * Explanation of what a protection domain is.
>>>>
>>>> A protection domain is a unique view of memory and is represented by the
>>>> value in the PKRU register.
>> Out something about this in pkey(7), but explain what you mean by a
>> "unique view of memory".
> 
> Let's say there are only two protection keys: 0 and 1.  There are two
> disable bits per protection key (Access and Write Disable), so a two-key
> PKRU looks like:
> 
> |   PKEY0   |   PKEY1   |
> | AD0 | WD0 | AD1 | WD1 |
> 
> In this example, there are 16 possible protection domains, one for each
> possible combination of the 4 rights-disable bits.
> 
> "Changing a protection domain" would mean changing (setting or clearing)
> the value of any of those 4 bits.  Each unique value of PKRU represents
> a view of memory, or unique protection domain.

Again, some of this could make its way into pkey(7). And I guess there
are useful nuggets for that page to be found in Jon's article at
https://lwn.net/Articles/667156/

Thanks,

Michael




-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 26/34] mm: implement new mprotect_key() system call
@ 2015-12-11 20:13                 ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 145+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-12-11 20:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: mtk.manpages, lkml, linux-mm@kvack.org, x86@kernel.org,
	dave.hansen, Linux API

On 12/09/2015 06:05 PM, Dave Hansen wrote:
> On 12/09/2015 08:45 AM, Michael Kerrisk (man-pages) wrote:
>>>>>> * Explanation of what a protection domain is.
>>>>
>>>> A protection domain is a unique view of memory and is represented by the
>>>> value in the PKRU register.
>> Out something about this in pkey(7), but explain what you mean by a
>> "unique view of memory".
> 
> Let's say there are only two protection keys: 0 and 1.  There are two
> disable bits per protection key (Access and Write Disable), so a two-key
> PKRU looks like:
> 
> |   PKEY0   |   PKEY1   |
> | AD0 | WD0 | AD1 | WD1 |
> 
> In this example, there are 16 possible protection domains, one for each
> possible combination of the 4 rights-disable bits.
> 
> "Changing a protection domain" would mean changing (setting or clearing)
> the value of any of those 4 bits.  Each unique value of PKRU represents
> a view of memory, or unique protection domain.

Again, some of this could make its way into pkey(7). And I guess there
are useful nuggets for that page to be found in Jon's article at
https://lwn.net/Articles/667156/

Thanks,

Michael




-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 00/34] x86: Memory Protection Keys (v5)
  2015-12-04 23:38     ` Dave Hansen
  (?)
@ 2015-12-11 20:16       ` Andy Lutomirski
  -1 siblings, 0 replies; 145+ messages in thread
From: Andy Lutomirski @ 2015-12-11 20:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, X86 ML,
	Linux API, linux-arch, Andrea Arcangeli, Andrew Morton, Jan Kara,
	Kirill A. Shutemov, Naoya Horiguchi

On Fri, Dec 4, 2015 at 3:38 PM, Dave Hansen <dave@sr71.net> wrote:
> On 12/04/2015 03:31 PM, Andy Lutomirski wrote:
>> On Thu, Dec 3, 2015 at 5:14 PM, Dave Hansen <dave@sr71.net> wrote:
>>> Memory Protection Keys for User pages is a CPU feature which will
>>> first appear on Skylake Servers, but will also be supported on
>>> future non-server parts.  It provides a mechanism for enforcing
>>> page-based protections, but without requiring modification of the
>>> page tables when an application changes protection domains.  See
>>> the Documentation/ patch for more details.
>>
>> What, if anything, happened to the signal handling parts?
>
> Patches 12 and 13 contain most of it:
>
>         x86, pkeys: fill in pkey field in siginfo
>         signals, pkeys: notify userspace about protection key faults
>
> I decided to just not try to preserve the pkey_get/set() semantics
> across entering and returning from signals, fwiw.

Hmm.  I'll see if I can find some time this weekend to play with that
bit.  Maybe I can test by faking it and tweaking MPX instead of PKRU.

--Andy

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 00/34] x86: Memory Protection Keys (v5)
@ 2015-12-11 20:16       ` Andy Lutomirski
  0 siblings, 0 replies; 145+ messages in thread
From: Andy Lutomirski @ 2015-12-11 20:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, X86 ML,
	Linux API, linux-arch, Andrea Arcangeli, Andrew Morton, Jan Kara,
	Kirill A. Shutemov, Naoya Horiguchi

On Fri, Dec 4, 2015 at 3:38 PM, Dave Hansen <dave@sr71.net> wrote:
> On 12/04/2015 03:31 PM, Andy Lutomirski wrote:
>> On Thu, Dec 3, 2015 at 5:14 PM, Dave Hansen <dave@sr71.net> wrote:
>>> Memory Protection Keys for User pages is a CPU feature which will
>>> first appear on Skylake Servers, but will also be supported on
>>> future non-server parts.  It provides a mechanism for enforcing
>>> page-based protections, but without requiring modification of the
>>> page tables when an application changes protection domains.  See
>>> the Documentation/ patch for more details.
>>
>> What, if anything, happened to the signal handling parts?
>
> Patches 12 and 13 contain most of it:
>
>         x86, pkeys: fill in pkey field in siginfo
>         signals, pkeys: notify userspace about protection key faults
>
> I decided to just not try to preserve the pkey_get/set() semantics
> across entering and returning from signals, fwiw.

Hmm.  I'll see if I can find some time this weekend to play with that
bit.  Maybe I can test by faking it and tweaking MPX instead of PKRU.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH 00/34] x86: Memory Protection Keys (v5)
@ 2015-12-11 20:16       ` Andy Lutomirski
  0 siblings, 0 replies; 145+ messages in thread
From: Andy Lutomirski @ 2015-12-11 20:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, X86 ML,
	Linux API, linux-arch, Andrea Arcangeli, Andrew Morton, Jan Kara,
	Kirill A. Shutemov, Naoya Horiguchi

On Fri, Dec 4, 2015 at 3:38 PM, Dave Hansen <dave-gkUM19QKKo4@public.gmane.org> wrote:
> On 12/04/2015 03:31 PM, Andy Lutomirski wrote:
>> On Thu, Dec 3, 2015 at 5:14 PM, Dave Hansen <dave-gkUM19QKKo4@public.gmane.org> wrote:
>>> Memory Protection Keys for User pages is a CPU feature which will
>>> first appear on Skylake Servers, but will also be supported on
>>> future non-server parts.  It provides a mechanism for enforcing
>>> page-based protections, but without requiring modification of the
>>> page tables when an application changes protection domains.  See
>>> the Documentation/ patch for more details.
>>
>> What, if anything, happened to the signal handling parts?
>
> Patches 12 and 13 contain most of it:
>
>         x86, pkeys: fill in pkey field in siginfo
>         signals, pkeys: notify userspace about protection key faults
>
> I decided to just not try to preserve the pkey_get/set() semantics
> across entering and returning from signals, fwiw.

Hmm.  I'll see if I can find some time this weekend to play with that
bit.  Maybe I can test by faking it and tweaking MPX instead of PKRU.

--Andy

^ permalink raw reply	[flat|nested] 145+ messages in thread

end of thread, other threads:[~2015-12-11 20:16 UTC | newest]

Thread overview: 145+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-04  1:14 [PATCH 00/34] x86: Memory Protection Keys (v5) Dave Hansen
2015-12-04  1:14 ` Dave Hansen
2015-12-04  1:14 ` Dave Hansen
2015-12-04  1:14 ` [PATCH 01/34] mm, gup: introduce concept of "foreign" get_user_pages() Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 02/34] x86, fpu: add placeholder for Processor Trace XSAVE state Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 03/34] x86, pkeys: Add Kconfig option Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 04/34] x86, pkeys: cpuid bit definition Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 05/34] x86, pkeys: define new CR4 bit Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 06/34] x86, pkeys: add PKRU xsave fields and data structure(s) Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 07/34] x86, pkeys: PTE bits for storing protection key Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 08/34] x86, pkeys: new page fault error code bit: PF_PK Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 09/34] x86, pkeys: store protection in high VMA flags Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 14:17   ` Thomas Gleixner
2015-12-08 14:17     ` Thomas Gleixner
2015-12-04  1:14 ` [PATCH 10/34] x86, pkeys: arch-specific protection bits Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 15:15   ` [PATCH 10/34] x86, pkeys: arch-specific protection bitsy Thomas Gleixner
2015-12-08 15:15     ` Thomas Gleixner
2015-12-08 16:34     ` Dave Hansen
2015-12-08 16:34       ` Dave Hansen
2015-12-08 17:24       ` Thomas Gleixner
2015-12-08 17:24         ` Thomas Gleixner
2015-12-08 18:06         ` Dave Hansen
2015-12-08 18:29           ` Thomas Gleixner
2015-12-08 18:29             ` Thomas Gleixner
2015-12-08 18:35             ` Thomas Gleixner
2015-12-08 18:35               ` Thomas Gleixner
2015-12-04  1:14 ` [PATCH 11/34] x86, pkeys: pass VMA down in to fault signal generation code Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 12/34] signals, pkeys: notify userspace about protection key faults Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 13/34] x86, pkeys: fill in pkey field in siginfo Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 14/34] x86, pkeys: add functions to fetch PKRU Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 15:18   ` Thomas Gleixner
2015-12-08 15:18     ` Thomas Gleixner
2015-12-04  1:14 ` [PATCH 15/34] mm: factor out VMA fault permission checking Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 17:26   ` Thomas Gleixner
2015-12-08 17:26     ` Thomas Gleixner
2015-12-04  1:14 ` [PATCH 16/34] x86, mm: simplify get_user_pages() PTE bit handling Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 18:01   ` Thomas Gleixner
2015-12-08 18:01     ` Thomas Gleixner
2015-12-08 18:30     ` Dave Hansen
2015-12-08 18:30       ` Dave Hansen
2015-12-04  1:14 ` [PATCH 17/34] x86, pkeys: check VMAs and PTEs for protection keys Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 18:11   ` Thomas Gleixner
2015-12-08 18:11     ` Thomas Gleixner
2015-12-04  1:14 ` [PATCH 18/34] mm: add gup flag to indicate "foreign" mm access Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 19/34] x86, pkeys: optimize fault handling in access_error() Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 18:14   ` Thomas Gleixner
2015-12-08 18:14     ` Thomas Gleixner
2015-12-04  1:14 ` [PATCH 20/34] x86, pkeys: differentiate instruction fetches Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 18:17   ` Thomas Gleixner
2015-12-08 18:17     ` Thomas Gleixner
2015-12-04  1:14 ` [PATCH 21/34] x86, pkeys: dump PKRU with other kernel registers Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 18:19   ` Thomas Gleixner
2015-12-08 18:19     ` Thomas Gleixner
2015-12-04  1:14 ` [PATCH 22/34] x86, pkeys: dump PTE pkey in /proc/pid/smaps Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 18:20   ` Thomas Gleixner
2015-12-08 18:20     ` Thomas Gleixner
2015-12-04  1:14 ` [PATCH 23/34] x86, pkeys: add Kconfig prompt to existing config option Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 18:21   ` Thomas Gleixner
2015-12-08 18:21     ` Thomas Gleixner
2015-12-04  1:14 ` [PATCH 24/34] mm, multi-arch: pass a protection key in to calc_vm_flag_bits() Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-04  1:14 ` [PATCH 25/34] x86, pkeys: add arch_validate_pkey() Dave Hansen
2015-12-04  1:14   ` Dave Hansen
2015-12-08 18:39   ` Thomas Gleixner
2015-12-08 18:39     ` Thomas Gleixner
2015-12-04  1:15 ` [PATCH 26/34] mm: implement new mprotect_key() system call Dave Hansen
2015-12-04  1:15   ` Dave Hansen
2015-12-05  6:50   ` Michael Kerrisk (man-pages)
2015-12-05  6:50     ` Michael Kerrisk (man-pages)
2015-12-05  6:50     ` Michael Kerrisk (man-pages)
2015-12-07 16:44     ` Dave Hansen
2015-12-07 16:44       ` Dave Hansen
2015-12-09 11:08       ` Michael Kerrisk (man-pages)
2015-12-09 11:08         ` Michael Kerrisk (man-pages)
2015-12-09 15:48         ` Dave Hansen
2015-12-09 15:48           ` Dave Hansen
2015-12-09 16:45           ` Michael Kerrisk (man-pages)
2015-12-09 16:45             ` Michael Kerrisk (man-pages)
2015-12-09 16:45             ` Michael Kerrisk (man-pages)
2015-12-09 17:05             ` Dave Hansen
2015-12-09 17:05               ` Dave Hansen
2015-12-09 17:05               ` Dave Hansen
2015-12-11 20:13               ` Michael Kerrisk (man-pages)
2015-12-11 20:13                 ` Michael Kerrisk (man-pages)
2015-12-04  1:15 ` [PATCH 27/34] x86, pkeys: make mprotect_key() mask off additional vm_flags Dave Hansen
2015-12-04  1:15   ` Dave Hansen
2015-12-08 18:41   ` Thomas Gleixner
2015-12-08 18:41     ` Thomas Gleixner
2015-12-04  1:15 ` [PATCH 28/34] x86: wire up mprotect_key() system call Dave Hansen
2015-12-04  1:15   ` Dave Hansen
2015-12-08 18:44   ` Thomas Gleixner
2015-12-08 18:44     ` Thomas Gleixner
2015-12-08 18:44     ` Thomas Gleixner
2015-12-08 19:06     ` Dave Hansen
2015-12-08 19:06       ` Dave Hansen
2015-12-08 20:38       ` Thomas Gleixner
2015-12-08 20:38         ` Thomas Gleixner
2015-12-08 20:38         ` Thomas Gleixner
2015-12-04  1:15 ` [PATCH 29/34] x86: separate out LDT init from context init Dave Hansen
2015-12-04  1:15   ` Dave Hansen
2015-12-08 18:45   ` Thomas Gleixner
2015-12-08 18:45     ` Thomas Gleixner
2015-12-04  1:15 ` [PATCH 30/34] x86, fpu: allow setting of XSAVE state Dave Hansen
2015-12-04  1:15   ` Dave Hansen
2015-12-08 18:48   ` Thomas Gleixner
2015-12-08 18:48     ` Thomas Gleixner
2015-12-04  1:15 ` [PATCH 31/34] x86, pkeys: allocation/free syscalls Dave Hansen
2015-12-04  1:15   ` Dave Hansen
2015-12-04  1:15 ` [PATCH 32/34] x86, pkeys: add pkey set/get syscalls Dave Hansen
2015-12-04  1:15   ` Dave Hansen
2015-12-04  1:15 ` [PATCH 33/34] x86, pkeys: actually enable Memory Protection Keys in CPU Dave Hansen
2015-12-04  1:15   ` Dave Hansen
2015-12-04  1:15 ` [PATCH 34/34] x86, pkeys: Documentation Dave Hansen
2015-12-04  1:15   ` Dave Hansen
2015-12-04 23:31 ` [PATCH 00/34] x86: Memory Protection Keys (v5) Andy Lutomirski
2015-12-04 23:31   ` Andy Lutomirski
2015-12-04 23:38   ` Dave Hansen
2015-12-04 23:38     ` Dave Hansen
2015-12-04 23:38     ` Dave Hansen
2015-12-11 20:16     ` Andy Lutomirski
2015-12-11 20:16       ` Andy Lutomirski
2015-12-11 20:16       ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.