LKML Archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] pagemap: make useable for non-privilege users
@ 2015-05-12  9:43 Konstantin Khlebnikov
  2015-05-12  9:43 ` [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here Konstantin Khlebnikov
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Konstantin Khlebnikov @ 2015-05-12  9:43 UTC (permalink / raw
  To: linux-mm, Naoya Horiguchi, linux-kernel, Andrew Morton
  Cc: Mark Williamson, Pavel Emelyanov, linux-api, Andy Lutomirski,
	Vlastimil Babka, Pavel Machek, Mark Seaborn, Kirill A. Shutemov,
	Linus Torvalds, Daniel James, Finn Grimwood

This patchset tries to make pagemap useable again in the safe way.
First patch adds bit 'map-exlusive' which is set if page is mapped only here.
Second patch restores access for non-privileged users but hides pfn if task
has no capability CAP_SYS_ADMIN. Third patch removes page-shift bits and
completes migration to the new pagemap format (flags soft-dirty and
mmap-exlusive are available only in the new format).

---

Konstantin Khlebnikov (3):
      pagemap: add mmap-exclusive bit for marking pages mapped only here
      pagemap: hide physical addresses from non-privileged users
      pagemap: switch to the new format and do some cleanup


 Documentation/vm/pagemap.txt |    3 -
 fs/proc/task_mmu.c           |  178 +++++++++++++++++-------------------------
 tools/vm/page-types.c        |   35 ++++----
 3 files changed, 91 insertions(+), 125 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here
  2015-05-12  9:43 [PATCH RFC 0/3] pagemap: make useable for non-privilege users Konstantin Khlebnikov
@ 2015-05-12  9:43 ` Konstantin Khlebnikov
  2015-05-12 10:40   ` Kirill A. Shutemov
  2015-05-12 12:05   ` Mark Williamson
  2015-05-12  9:43 ` [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users Konstantin Khlebnikov
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 18+ messages in thread
From: Konstantin Khlebnikov @ 2015-05-12  9:43 UTC (permalink / raw
  To: linux-mm, Naoya Horiguchi, linux-kernel, Andrew Morton
  Cc: Mark Williamson, Pavel Emelyanov, linux-api, Andy Lutomirski,
	Vlastimil Babka, Pavel Machek, Mark Seaborn, Kirill A. Shutemov,
	Linus Torvalds, Daniel James, Finn Grimwood

This patch sets bit 56 in pagemap if this page is mapped only once.
It allows to detect exclusively used pages without exposing PFN:

present file exclusive state
0       0    0         non-present
1       1    0         file page mapped somewhere else
1       1    1         file page mapped only here
1       0    0         anon non-CoWed page (shared with parent/child)
1       0    1         anon CoWed page (or never forked)

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Link: lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com

---

v2:
* handle transparent huge pages
* invert bit and rename shared -> exclusive (less confusing name)
---
 Documentation/vm/pagemap.txt |    3 ++-
 fs/proc/task_mmu.c           |   10 ++++++++++
 tools/vm/page-types.c        |   12 ++++++++++++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..3cfbbb333ea1 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -16,7 +16,8 @@ There are three components to pagemap:
     * Bits 0-4   swap type if swapped
     * Bits 5-54  swap offset if swapped
     * Bit  55    pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
-    * Bits 56-60 zero
+    * Bit  56    page exlusively mapped
+    * Bits 57-60 zero
     * Bit  61    page is file-page or shared-anon
     * Bit  62    page swapped
     * Bit  63    page present
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d013ff..29febec65de4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -982,6 +982,7 @@ struct pagemapread {
 #define PM_STATUS2(v2, x)   (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
 
 #define __PM_SOFT_DIRTY      (1LL)
+#define __PM_MMAP_EXCLUSIVE  (2LL)
 #define PM_PRESENT          PM_STATUS(4LL)
 #define PM_SWAP             PM_STATUS(2LL)
 #define PM_FILE             PM_STATUS(1LL)
@@ -1074,6 +1075,8 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 
 	if (page && !PageAnon(page))
 		flags |= PM_FILE;
+	if (page && page_mapcount(page) == 1)
+		flags2 |= __PM_MMAP_EXCLUSIVE;
 	if ((vma->vm_flags & VM_SOFTDIRTY))
 		flags2 |= __PM_SOFT_DIRTY;
 
@@ -1119,6 +1122,13 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		else
 			pmd_flags2 = 0;
 
+		if (pmd_present(*pmd)) {
+			struct page *page = pmd_page(*pmd);
+
+			if (page_mapcount(page) == 1)
+				pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
+		}
+
 		for (; addr != end; addr += PAGE_SIZE) {
 			unsigned long offset;
 			pagemap_entry_t pme;
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 8bdf16b8ba60..3a9f193526ee 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -70,9 +70,12 @@
 #define PM_PFRAME(x)        ((x) & PM_PFRAME_MASK)
 
 #define __PM_SOFT_DIRTY      (1LL)
+#define __PM_MMAP_EXCLUSIVE  (2LL)
 #define PM_PRESENT          PM_STATUS(4LL)
 #define PM_SWAP             PM_STATUS(2LL)
+#define PM_FILE             PM_STATUS(1LL)
 #define PM_SOFT_DIRTY       __PM_PSHIFT(__PM_SOFT_DIRTY)
+#define PM_MMAP_EXCLUSIVE   __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)
 
 
 /*
@@ -100,6 +103,8 @@
 #define KPF_SLOB_FREE		49
 #define KPF_SLUB_FROZEN		50
 #define KPF_SLUB_DEBUG		51
+#define KPF_FILE		62
+#define KPF_MMAP_EXCLUSIVE	63
 
 #define KPF_ALL_BITS		((uint64_t)~0ULL)
 #define KPF_HACKERS_BITS	(0xffffULL << 32)
@@ -149,6 +154,9 @@ static const char * const page_flag_names[] = {
 	[KPF_SLOB_FREE]		= "P:slob_free",
 	[KPF_SLUB_FROZEN]	= "A:slub_frozen",
 	[KPF_SLUB_DEBUG]	= "E:slub_debug",
+
+	[KPF_FILE]		= "F:file",
+	[KPF_MMAP_EXCLUSIVE]	= "1:mmap_exclusive",
 };
 
 
@@ -452,6 +460,10 @@ static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)
 
 	if (pme & PM_SOFT_DIRTY)
 		flags |= BIT(SOFTDIRTY);
+	if (pme & PM_FILE)
+		flags |= BIT(FILE);
+	if (pme & PM_MMAP_EXCLUSIVE)
+		flags |= BIT(MMAP_EXCLUSIVE);
 
 	return flags;
 }


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users
  2015-05-12  9:43 [PATCH RFC 0/3] pagemap: make useable for non-privilege users Konstantin Khlebnikov
  2015-05-12  9:43 ` [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here Konstantin Khlebnikov
@ 2015-05-12  9:43 ` Konstantin Khlebnikov
  2015-05-12 11:22   ` Mark Williamson
  2015-05-12 15:06   ` Linus Torvalds
  2015-05-12  9:43 ` [PATCH v2 3/3] pagemap: switch to the new format and do some cleanup Konstantin Khlebnikov
  2015-05-12 11:13 ` [PATCH RFC 0/3] pagemap: make useable for non-privilege users Mark Williamson
  3 siblings, 2 replies; 18+ messages in thread
From: Konstantin Khlebnikov @ 2015-05-12  9:43 UTC (permalink / raw
  To: linux-mm, Naoya Horiguchi, linux-kernel, Andrew Morton
  Cc: Mark Williamson, Pavel Emelyanov, linux-api, Andy Lutomirski,
	Vlastimil Babka, Pavel Machek, Mark Seaborn, Kirill A. Shutemov,
	Linus Torvalds, Daniel James, Finn Grimwood

This patch makes pagemap readable for normal users back but hides physical
addresses from them. For some use cases PFN isn't required at all: flags
give information about presence, page type (anon/file/swap), soft-dirty mark,
and hint about page mapcount state: exclusive(mapcount = 1) or (mapcount > 1).

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
Link: lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name
---
 fs/proc/task_mmu.c |   36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 29febec65de4..0b7a8ffec95f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -962,6 +962,7 @@ struct pagemapread {
 	int pos, len;		/* units: PM_ENTRY_BYTES, not bytes */
 	pagemap_entry_t *buffer;
 	bool v2;
+	bool show_pfn;
 };
 
 #define PAGEMAP_WALK_SIZE	(PMD_SIZE)
@@ -1046,12 +1047,13 @@ out:
 static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
 {
-	u64 frame, flags;
+	u64 frame = 0, flags;
 	struct page *page = NULL;
 	int flags2 = 0;
 
 	if (pte_present(pte)) {
-		frame = pte_pfn(pte);
+		if (pm->show_pfn)
+			frame = pte_pfn(pte);
 		flags = PM_PRESENT;
 		page = vm_normal_page(vma, addr, pte);
 		if (pte_soft_dirty(pte))
@@ -1087,15 +1089,19 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 		pmd_t pmd, int offset, int pmd_flags2)
 {
+	u64 frame = 0;
+
 	/*
 	 * Currently pmd for thp is always present because thp can not be
 	 * swapped-out, migrated, or HWPOISONed (split in such cases instead.)
 	 * This if-check is just to prepare for future implementation.
 	 */
-	if (pmd_present(pmd))
-		*pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
-				| PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT);
-	else
+	if (pmd_present(pmd)) {
+		if (pm->show_pfn)
+			frame = pmd_pfn(pmd) + offset;
+		*pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
+				PM_STATUS2(pm->v2, pmd_flags2));
+	} else
 		*pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
 }
 #else
@@ -1171,11 +1177,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 					pte_t pte, int offset, int flags2)
 {
-	if (pte_present(pte))
-		*pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset)	|
-				PM_STATUS2(pm->v2, flags2)		|
-				PM_PRESENT);
-	else
+	u64 frame = 0;
+
+	if (pte_present(pte)) {
+		if (pm->show_pfn)
+			frame = pte_pfn(pte) + offset;
+		*pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
+				PM_STATUS2(pm->v2, flags2));
+	} else
 		*pme = make_pme(PM_NOT_PRESENT(pm->v2)			|
 				PM_STATUS2(pm->v2, flags2));
 }
@@ -1260,6 +1269,8 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
 	if (!count)
 		goto out_task;
 
+	/* do not disclose physical addresses: attack vector */
+	pm.show_pfn = capable(CAP_SYS_ADMIN);
 	pm.v2 = soft_dirty_cleared;
 	pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
 	pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
@@ -1335,9 +1346,6 @@ out:
 
 static int pagemap_open(struct inode *inode, struct file *file)
 {
-	/* do not disclose physical addresses: attack vector */
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
 	pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
 			"to stop being page-shift some time soon. See the "
 			"linux/Documentation/vm/pagemap.txt for details.\n");


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/3] pagemap: switch to the new format and do some cleanup
  2015-05-12  9:43 [PATCH RFC 0/3] pagemap: make useable for non-privilege users Konstantin Khlebnikov
  2015-05-12  9:43 ` [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here Konstantin Khlebnikov
  2015-05-12  9:43 ` [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users Konstantin Khlebnikov
@ 2015-05-12  9:43 ` Konstantin Khlebnikov
  2015-05-12 10:54   ` Kirill A. Shutemov
  2015-05-12 11:13 ` [PATCH RFC 0/3] pagemap: make useable for non-privilege users Mark Williamson
  3 siblings, 1 reply; 18+ messages in thread
From: Konstantin Khlebnikov @ 2015-05-12  9:43 UTC (permalink / raw
  To: linux-mm, Naoya Horiguchi, linux-kernel, Andrew Morton
  Cc: Mark Williamson, Pavel Emelyanov, linux-api, Andy Lutomirski,
	Vlastimil Babka, Pavel Machek, Mark Seaborn, Kirill A. Shutemov,
	Linus Torvalds, Daniel James, Finn Grimwood

This patch removes page-shift bits (scheduled to remove since 3.11) and
completes migration to the new bit layout. Also it cleans messy macro.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/proc/task_mmu.c    |  152 ++++++++++++++++---------------------------------
 tools/vm/page-types.c |   29 +++------
 2 files changed, 58 insertions(+), 123 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0b7a8ffec95f..66bc7207ce90 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -710,23 +710,6 @@ const struct file_operations proc_tid_smaps_operations = {
 	.release	= proc_map_release,
 };
 
-/*
- * We do not want to have constant page-shift bits sitting in
- * pagemap entries and are about to reuse them some time soon.
- *
- * Here's the "migration strategy":
- * 1. when the system boots these bits remain what they are,
- *    but a warning about future change is printed in log;
- * 2. once anyone clears soft-dirty bits via clear_refs file,
- *    these flag is set to denote, that user is aware of the
- *    new API and those page-shift bits change their meaning.
- *    The respective warning is printed in dmesg;
- * 3. In a couple of releases we will remove all the mentions
- *    of page-shift in pagemap entries.
- */
-
-static bool soft_dirty_cleared __read_mostly;
-
 enum clear_refs_types {
 	CLEAR_REFS_ALL = 1,
 	CLEAR_REFS_ANON,
@@ -887,13 +870,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 	if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
 		return -EINVAL;
 
-	if (type == CLEAR_REFS_SOFT_DIRTY) {
-		soft_dirty_cleared = true;
-		pr_warn_once("The pagemap bits 55-60 has changed their meaning!"
-			     " See the linux/Documentation/vm/pagemap.txt for "
-			     "details.\n");
-	}
-
 	task = get_proc_task(file_inode(file));
 	if (!task)
 		return -ESRCH;
@@ -961,38 +937,26 @@ typedef struct {
 struct pagemapread {
 	int pos, len;		/* units: PM_ENTRY_BYTES, not bytes */
 	pagemap_entry_t *buffer;
-	bool v2;
 	bool show_pfn;
 };
 
 #define PAGEMAP_WALK_SIZE	(PMD_SIZE)
 #define PAGEMAP_WALK_MASK	(PMD_MASK)
 
-#define PM_ENTRY_BYTES      sizeof(pagemap_entry_t)
-#define PM_STATUS_BITS      3
-#define PM_STATUS_OFFSET    (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK      (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr)       (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS      6
-#define PM_PSHIFT_OFFSET    (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK      (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define __PM_PSHIFT(x)      (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK      ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x)        ((x) & PM_PFRAME_MASK)
-/* in "new" pagemap pshift bits are occupied with more status bits */
-#define PM_STATUS2(v2, x)   (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
-
-#define __PM_SOFT_DIRTY      (1LL)
-#define __PM_MMAP_EXCLUSIVE  (2LL)
-#define PM_PRESENT          PM_STATUS(4LL)
-#define PM_SWAP             PM_STATUS(2LL)
-#define PM_FILE             PM_STATUS(1LL)
-#define PM_NOT_PRESENT(v2)  PM_STATUS2(v2, 0)
+#define PM_ENTRY_BYTES		sizeof(pagemap_entry_t)
+#define PM_PFEAME_BITS		54
+#define PM_PFRAME_MASK		GENMASK_ULL(PM_PFEAME_BITS - 1, 0)
+#define PM_SOFT_DIRTY		BIT_ULL(55)
+#define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
+#define PM_FILE			BIT_ULL(61)
+#define PM_SWAP			BIT_ULL(62)
+#define PM_PRESENT		BIT_ULL(63)
+
 #define PM_END_OF_BUFFER    1
 
-static inline pagemap_entry_t make_pme(u64 val)
+static inline pagemap_entry_t make_pme(u64 frame, u64 flags)
 {
-	return (pagemap_entry_t) { .pme = val };
+	return (pagemap_entry_t) { .pme = (frame & PM_PFRAME_MASK) | flags };
 }
 
 static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme,
@@ -1013,7 +977,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
 
 	while (addr < end) {
 		struct vm_area_struct *vma = find_vma(walk->mm, addr);
-		pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
+		pagemap_entry_t pme = make_pme(0, 0);
 		/* End of address space hole, which we mark as non-present. */
 		unsigned long hole_end;
 
@@ -1033,7 +997,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
 
 		/* Addresses in the VMA. */
 		if (vma->vm_flags & VM_SOFTDIRTY)
-			pme.pme |= PM_STATUS2(pm->v2, __PM_SOFT_DIRTY);
+			pme = make_pme(0, PM_SOFT_DIRTY);
 		for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) {
 			err = add_to_pagemap(addr, &pme, pm);
 			if (err)
@@ -1044,50 +1008,44 @@ out:
 	return err;
 }
 
-static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
+static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
 {
-	u64 frame = 0, flags;
+	u64 frame = 0, flags = 0;
 	struct page *page = NULL;
-	int flags2 = 0;
 
 	if (pte_present(pte)) {
 		if (pm->show_pfn)
 			frame = pte_pfn(pte);
-		flags = PM_PRESENT;
+		flags |= PM_PRESENT;
 		page = vm_normal_page(vma, addr, pte);
 		if (pte_soft_dirty(pte))
-			flags2 |= __PM_SOFT_DIRTY;
+			flags |= PM_SOFT_DIRTY;
 	} else if (is_swap_pte(pte)) {
 		swp_entry_t entry;
 		if (pte_swp_soft_dirty(pte))
-			flags2 |= __PM_SOFT_DIRTY;
+			flags |= PM_SOFT_DIRTY;
 		entry = pte_to_swp_entry(pte);
 		frame = swp_type(entry) |
 			(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
-		flags = PM_SWAP;
+		flags |= PM_SWAP;
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
-	} else {
-		if (vma->vm_flags & VM_SOFTDIRTY)
-			flags2 |= __PM_SOFT_DIRTY;
-		*pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, flags2));
-		return;
 	}
 
 	if (page && !PageAnon(page))
 		flags |= PM_FILE;
 	if (page && page_mapcount(page) == 1)
-		flags2 |= __PM_MMAP_EXCLUSIVE;
-	if ((vma->vm_flags & VM_SOFTDIRTY))
-		flags2 |= __PM_SOFT_DIRTY;
+		flags |= PM_MMAP_EXCLUSIVE;
+	if (vma->vm_flags & VM_SOFTDIRTY)
+		flags |= PM_SOFT_DIRTY;
 
-	*pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags);
+	return make_pme(frame, flags);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
-		pmd_t pmd, int offset, int pmd_flags2)
+static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
+		pmd_t pmd, int offset, u64 flags)
 {
 	u64 frame = 0;
 
@@ -1099,15 +1057,16 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *p
 	if (pmd_present(pmd)) {
 		if (pm->show_pfn)
 			frame = pmd_pfn(pmd) + offset;
-		*pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
-				PM_STATUS2(pm->v2, pmd_flags2));
-	} else
-		*pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
+		flags |= PM_PRESENT;
+	}
+
+	return make_pme(frame, flags);
 }
 #else
-static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
-		pmd_t pmd, int offset, int pmd_flags2)
+static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
+		pmd_t pmd, int offset, u64 flags)
 {
+	return make_pme(0, 0);
 }
 #endif
 
@@ -1121,18 +1080,16 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	int err = 0;
 
 	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
-		int pmd_flags2;
+		u64 flags = 0;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
-			pmd_flags2 = __PM_SOFT_DIRTY;
-		else
-			pmd_flags2 = 0;
+			flags |= PM_SOFT_DIRTY;
 
 		if (pmd_present(*pmd)) {
 			struct page *page = pmd_page(*pmd);
 
 			if (page_mapcount(page) == 1)
-				pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
+				flags |= PM_MMAP_EXCLUSIVE;
 		}
 
 		for (; addr != end; addr += PAGE_SIZE) {
@@ -1141,7 +1098,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 			offset = (addr & ~PAGEMAP_WALK_MASK) >>
 					PAGE_SHIFT;
-			thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2);
+			pme = thp_pmd_to_pagemap_entry(pm, *pmd, offset, flags);
 			err = add_to_pagemap(addr, &pme, pm);
 			if (err)
 				break;
@@ -1161,7 +1118,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
 		pagemap_entry_t pme;
 
-		pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
+		pme = pte_to_pagemap_entry(pm, vma, addr, *pte);
 		err = add_to_pagemap(addr, &pme, pm);
 		if (err)
 			break;
@@ -1174,19 +1131,18 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
-					pte_t pte, int offset, int flags2)
+static pagemap_entry_t huge_pte_to_pagemap_entry(struct pagemapread *pm,
+					pte_t pte, int offset, u64 flags)
 {
 	u64 frame = 0;
 
 	if (pte_present(pte)) {
 		if (pm->show_pfn)
 			frame = pte_pfn(pte) + offset;
-		*pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
-				PM_STATUS2(pm->v2, flags2));
-	} else
-		*pme = make_pme(PM_NOT_PRESENT(pm->v2)			|
-				PM_STATUS2(pm->v2, flags2));
+		flags |= PM_PRESENT;
+	}
+
+	return make_pme(frame, flags);
 }
 
 /* This function walks within one hugetlb entry in the single call */
@@ -1197,17 +1153,15 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
 	struct pagemapread *pm = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 	int err = 0;
-	int flags2;
+	u64 flags = 0;
 	pagemap_entry_t pme;
 
 	if (vma->vm_flags & VM_SOFTDIRTY)
-		flags2 = __PM_SOFT_DIRTY;
-	else
-		flags2 = 0;
+		flags |= PM_SOFT_DIRTY;
 
 	for (; addr != end; addr += PAGE_SIZE) {
 		int offset = (addr & ~hmask) >> PAGE_SHIFT;
-		huge_pte_to_pagemap_entry(&pme, pm, *pte, offset, flags2);
+		pme = huge_pte_to_pagemap_entry(pm, *pte, offset, flags);
 		err = add_to_pagemap(addr, &pme, pm);
 		if (err)
 			return err;
@@ -1228,7 +1182,9 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
  * Bits 0-54  page frame number (PFN) if present
  * Bits 0-4   swap type if swapped
  * Bits 5-54  swap offset if swapped
- * Bits 55-60 page shift (page size = 1<<page shift)
+ * Bit  55    pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
+ * Bit  56    page exclusively mapped
+ * Bits 57-60 zero
  * Bit  61    page is file-page or shared-anon
  * Bit  62    page swapped
  * Bit  63    page present
@@ -1271,7 +1227,6 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
 
 	/* do not disclose physical addresses: attack vector */
 	pm.show_pfn = capable(CAP_SYS_ADMIN);
-	pm.v2 = soft_dirty_cleared;
 	pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
 	pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
 	ret = -ENOMEM;
@@ -1344,18 +1299,9 @@ out:
 	return ret;
 }
 
-static int pagemap_open(struct inode *inode, struct file *file)
-{
-	pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
-			"to stop being page-shift some time soon. See the "
-			"linux/Documentation/vm/pagemap.txt for details.\n");
-	return 0;
-}
-
 const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
-	.open		= pagemap_open,
 };
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 3a9f193526ee..1fa872e58238 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -57,26 +57,15 @@
  * pagemap kernel ABI bits
  */
 
-#define PM_ENTRY_BYTES      sizeof(uint64_t)
-#define PM_STATUS_BITS      3
-#define PM_STATUS_OFFSET    (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK      (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr)       (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS      6
-#define PM_PSHIFT_OFFSET    (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK      (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define __PM_PSHIFT(x)      (((uint64_t) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK      ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x)        ((x) & PM_PFRAME_MASK)
-
-#define __PM_SOFT_DIRTY      (1LL)
-#define __PM_MMAP_EXCLUSIVE  (2LL)
-#define PM_PRESENT          PM_STATUS(4LL)
-#define PM_SWAP             PM_STATUS(2LL)
-#define PM_FILE             PM_STATUS(1LL)
-#define PM_SOFT_DIRTY       __PM_PSHIFT(__PM_SOFT_DIRTY)
-#define PM_MMAP_EXCLUSIVE   __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)
-
+#define PM_ENTRY_BYTES		8
+#define PM_PFEAME_BITS		54
+#define PM_PFRAME_MASK		((1LL << PM_PFEAME_BITS) - 1)
+#define PM_PFRAME(x)		((x) & PM_PFRAME_MASK)
+#define PM_SOFT_DIRTY		(1ULL << 55)
+#define PM_MMAP_EXCLUSIVE	(1ULL << 56)
+#define PM_FILE			(1ULL << 61)
+#define PM_SWAP			(1ULL << 62)
+#define PM_PRESENT		(1ULL << 63)
 
 /*
  * kernel page flags


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here
  2015-05-12  9:43 ` [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here Konstantin Khlebnikov
@ 2015-05-12 10:40   ` Kirill A. Shutemov
  2015-05-13 10:59     ` Konstantin Khlebnikov
  2015-05-12 12:05   ` Mark Williamson
  1 sibling, 1 reply; 18+ messages in thread
From: Kirill A. Shutemov @ 2015-05-12 10:40 UTC (permalink / raw
  To: Konstantin Khlebnikov
  Cc: linux-mm, Naoya Horiguchi, linux-kernel, Andrew Morton,
	Mark Williamson, Pavel Emelyanov, linux-api, Andy Lutomirski,
	Vlastimil Babka, Pavel Machek, Mark Seaborn, Linus Torvalds,
	Daniel James, Finn Grimwood

On Tue, May 12, 2015 at 12:43:03PM +0300, Konstantin Khlebnikov wrote:
> This patch sets bit 56 in pagemap if this page is mapped only once.
> It allows to detect exclusively used pages without exposing PFN:
> 
> present file exclusive state
> 0       0    0         non-present
> 1       1    0         file page mapped somewhere else
> 1       1    1         file page mapped only here
> 1       0    0         anon non-CoWed page (shared with parent/child)
> 1       0    1         anon CoWed page (or never forked)

Probably, worth noting that file-private pages are anon in this context.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/3] pagemap: switch to the new format and do some cleanup
  2015-05-12  9:43 ` [PATCH v2 3/3] pagemap: switch to the new format and do some cleanup Konstantin Khlebnikov
@ 2015-05-12 10:54   ` Kirill A. Shutemov
  2015-05-13 11:39     ` Konstantin Khlebnikov
  0 siblings, 1 reply; 18+ messages in thread
From: Kirill A. Shutemov @ 2015-05-12 10:54 UTC (permalink / raw
  To: Konstantin Khlebnikov
  Cc: linux-mm, Naoya Horiguchi, linux-kernel, Andrew Morton,
	Mark Williamson, Pavel Emelyanov, linux-api, Andy Lutomirski,
	Vlastimil Babka, Pavel Machek, Mark Seaborn, Linus Torvalds,
	Daniel James, Finn Grimwood

On Tue, May 12, 2015 at 12:43:06PM +0300, Konstantin Khlebnikov wrote:
> This patch removes page-shift bits (scheduled to remove since 3.11) and
> completes migration to the new bit layout. Also it cleans messy macro.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  fs/proc/task_mmu.c    |  152 ++++++++++++++++---------------------------------
>  tools/vm/page-types.c |   29 +++------
>  2 files changed, 58 insertions(+), 123 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 0b7a8ffec95f..66bc7207ce90 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -710,23 +710,6 @@ const struct file_operations proc_tid_smaps_operations = {
>  	.release	= proc_map_release,
>  };
>  
> -/*
> - * We do not want to have constant page-shift bits sitting in
> - * pagemap entries and are about to reuse them some time soon.
> - *
> - * Here's the "migration strategy":
> - * 1. when the system boots these bits remain what they are,
> - *    but a warning about future change is printed in log;
> - * 2. once anyone clears soft-dirty bits via clear_refs file,
> - *    these flag is set to denote, that user is aware of the
> - *    new API and those page-shift bits change their meaning.
> - *    The respective warning is printed in dmesg;
> - * 3. In a couple of releases we will remove all the mentions
> - *    of page-shift in pagemap entries.
> - */

Wouldn't it be better to just have v2=1 by default for couple releases to
see if anything breaks? This way we can revert easily if regression reported.
I guess someone could miss this change coming if he didn't touch clear_refs.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/3] pagemap: make useable for non-privilege users
  2015-05-12  9:43 [PATCH RFC 0/3] pagemap: make useable for non-privilege users Konstantin Khlebnikov
                   ` (2 preceding siblings ...)
  2015-05-12  9:43 ` [PATCH v2 3/3] pagemap: switch to the new format and do some cleanup Konstantin Khlebnikov
@ 2015-05-12 11:13 ` Mark Williamson
  2015-05-14 18:40   ` Mark Williamson
  3 siblings, 1 reply; 18+ messages in thread
From: Mark Williamson @ 2015-05-12 11:13 UTC (permalink / raw
  To: Konstantin Khlebnikov
  Cc: linux-mm, Naoya Horiguchi, kernel list, Andrew Morton,
	Pavel Emelyanov, Linux API, Andy Lutomirski, Vlastimil Babka,
	Pavel Machek, Mark Seaborn, Kirill A. Shutemov, Linus Torvalds,
	Daniel James, Finn Grimwood

Hi Konstantin,

Thanks very much for continuing to look at this!  It's very much
appreciated.  I've been investigating from our end but got caught up
in some gnarly details of our pagemap-consuming code.

I like the approach and it seems like the information you're exposing
will be useful for our application.  I'll test the patch and see if it
works for us as-is.

Will follow up with any comments on the individual patches.

Thanks,
Mark

On Tue, May 12, 2015 at 10:43 AM, Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
> This patchset tries to make pagemap useable again in the safe way.
> First patch adds bit 'map-exlusive' which is set if page is mapped only here.
> Second patch restores access for non-privileged users but hides pfn if task
> has no capability CAP_SYS_ADMIN. Third patch removes page-shift bits and
> completes migration to the new pagemap format (flags soft-dirty and
> mmap-exlusive are available only in the new format).
>
> ---
>
> Konstantin Khlebnikov (3):
>       pagemap: add mmap-exclusive bit for marking pages mapped only here
>       pagemap: hide physical addresses from non-privileged users
>       pagemap: switch to the new format and do some cleanup
>
>
>  Documentation/vm/pagemap.txt |    3 -
>  fs/proc/task_mmu.c           |  178 +++++++++++++++++-------------------------
>  tools/vm/page-types.c        |   35 ++++----
>  3 files changed, 91 insertions(+), 125 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users
  2015-05-12  9:43 ` [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users Konstantin Khlebnikov
@ 2015-05-12 11:22   ` Mark Williamson
  2015-05-12 15:06   ` Linus Torvalds
  1 sibling, 0 replies; 18+ messages in thread
From: Mark Williamson @ 2015-05-12 11:22 UTC (permalink / raw
  To: Konstantin Khlebnikov
  Cc: linux-mm, Naoya Horiguchi, kernel list, Andrew Morton,
	Pavel Emelyanov, Linux API, Andy Lutomirski, Vlastimil Babka,
	Pavel Machek, Mark Seaborn, Kirill A. Shutemov, Linus Torvalds,
	Daniel James, Finn Grimwood

Hi Konstantin,

Comments inline...

On Tue, May 12, 2015 at 10:43 AM, Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
> This patch makes pagemap readable for normal users back but hides physical
> addresses from them. For some use cases PFN isn't required at all: flags
> give information about presence, page type (anon/file/swap), soft-dirty mark,
> and hint about page mapcount state: exclusive(mapcount = 1) or (mapcount > 1).
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
> Link: lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name
> ---
>  fs/proc/task_mmu.c |   36 ++++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 29febec65de4..0b7a8ffec95f 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -962,6 +962,7 @@ struct pagemapread {
>         int pos, len;           /* units: PM_ENTRY_BYTES, not bytes */
>         pagemap_entry_t *buffer;
>         bool v2;
> +       bool show_pfn;
>  };
>
>  #define PAGEMAP_WALK_SIZE      (PMD_SIZE)
> @@ -1046,12 +1047,13 @@ out:
>  static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
>                 struct vm_area_struct *vma, unsigned long addr, pte_t pte)
>  {
> -       u64 frame, flags;
> +       u64 frame = 0, flags;
>         struct page *page = NULL;
>         int flags2 = 0;
>
>         if (pte_present(pte)) {
> -               frame = pte_pfn(pte);
> +               if (pm->show_pfn)
> +                       frame = pte_pfn(pte);
>                 flags = PM_PRESENT;
>                 page = vm_normal_page(vma, addr, pte);
>                 if (pte_soft_dirty(pte))
> @@ -1087,15 +1089,19 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
>  static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
>                 pmd_t pmd, int offset, int pmd_flags2)
>  {
> +       u64 frame = 0;
> +
>         /*
>          * Currently pmd for thp is always present because thp can not be
>          * swapped-out, migrated, or HWPOISONed (split in such cases instead.)
>          * This if-check is just to prepare for future implementation.
>          */
> -       if (pmd_present(pmd))
> -               *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
> -                               | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT);
> -       else
> +       if (pmd_present(pmd)) {
> +               if (pm->show_pfn)
> +                       frame = pmd_pfn(pmd) + offset;
> +               *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
> +                               PM_STATUS2(pm->v2, pmd_flags2));
> +       } else
>                 *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
>  }
>  #else
> @@ -1171,11 +1177,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
>                                         pte_t pte, int offset, int flags2)
>  {
> -       if (pte_present(pte))
> -               *pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset)        |
> -                               PM_STATUS2(pm->v2, flags2)              |
> -                               PM_PRESENT);
> -       else
> +       u64 frame = 0;
> +
> +       if (pte_present(pte)) {
> +               if (pm->show_pfn)
> +                       frame = pte_pfn(pte) + offset;
> +               *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
> +                               PM_STATUS2(pm->v2, flags2));
> +       } else
>                 *pme = make_pme(PM_NOT_PRESENT(pm->v2)                  |
>                                 PM_STATUS2(pm->v2, flags2));
>  }
> @@ -1260,6 +1269,8 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
>         if (!count)
>                 goto out_task;
>
> +       /* do not disclose physical addresses: attack vector */
> +       pm.show_pfn = capable(CAP_SYS_ADMIN);

If I understood correctly, Linus recommended to me that we use the
open-time capabilities of the file descriptor rather than the current
capability state (to mitigate against an attacker passing an FD to a
setuid process, I think).

FWIW, I knocked up a quick internal patch (less comprehensive than
yours!) and used file_ns_capable() successfully, i.e:
    pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);

It looked promising to be but I've not done the checking to verify
that this is strictly correct; the capabilities stuff is not an area
of the kernel I'm familiar with.

>         pm.v2 = soft_dirty_cleared;
>         pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>         pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
> @@ -1335,9 +1346,6 @@ out:
>
>  static int pagemap_open(struct inode *inode, struct file *file)
>  {
> -       /* do not disclose physical addresses: attack vector */
> -       if (!capable(CAP_SYS_ADMIN))
> -               return -EPERM;
>         pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
>                         "to stop being page-shift some time soon. See the "
>                         "linux/Documentation/vm/pagemap.txt for details.\n");
>

No other comments on this, looks like it would help us.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here
  2015-05-12  9:43 ` [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here Konstantin Khlebnikov
  2015-05-12 10:40   ` Kirill A. Shutemov
@ 2015-05-12 12:05   ` Mark Williamson
  2015-05-13 10:51     ` Konstantin Khlebnikov
  1 sibling, 1 reply; 18+ messages in thread
From: Mark Williamson @ 2015-05-12 12:05 UTC (permalink / raw
  To: Konstantin Khlebnikov
  Cc: linux-mm, Naoya Horiguchi, kernel list, Andrew Morton,
	Pavel Emelyanov, Linux API, Andy Lutomirski, Vlastimil Babka,
	Pavel Machek, Mark Seaborn, Kirill A. Shutemov, Linus Torvalds,
	Daniel James, Finn Grimwood

Hi Konstantin,

I hope you won't mind me thinking out loud here on the idea of adding
a flag to the v2 pagemap fields...  From a kernel PoV, I agree that
this seems like the cleanest approach.  However, with my application
developer hat on:

 1. I was hoping we'd be able to backport a compatible fix to older
kernels that might adopt the pagemap permissions change.  Using the V2
format flags rules out doing this for kernels that are too old to have
soft-dirty, I think.
 2. From our software's PoV, I feel it's worth noting that it doesn't
strictly fix ABI compatibility, though I realise that's probably not
your primary concern here.  We'll need to modify our code to write the
clear_refs file but that change is OK for us if it's the preferred
solution.

In the patches I've been playing with, I was considering putting the
Exclusive flag in the now-unused PFN field of the pagemap entries.
Since we're specifically trying to work around for the lack of PFN
information, would there be any appetite for mirroring this flag
unconditionally into the now-empty PFN field (i.e. whether using v1 or
v2 flags) when accessed by an unprivileged process?

I realise it's ugly from a kernel PoV and I feel a little bad for
suggesting it - but it would address points 1 and 2 for us (our
existing code just looks for changes in the pagemap entry, so sticking
the flag in there would cause it to do the right thing).

I'm sorry to raise application-specific issues at this point; I
appreciate that your primary concern is to improve the kernel and
technically I like the approach that you've taken!  I'll try and
provide more code-oriented feedback once I've tried out the changes.

Thanks,
Mark

On Tue, May 12, 2015 at 10:43 AM, Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
> This patch sets bit 56 in pagemap if this page is mapped only once.
> It allows to detect exclusively used pages without exposing PFN:
>
> present file exclusive state
> 0       0    0         non-present
> 1       1    0         file page mapped somewhere else
> 1       1    1         file page mapped only here
> 1       0    0         anon non-CoWed page (shared with parent/child)
> 1       0    1         anon CoWed page (or never forked)
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Link: lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com
>
> ---
>
> v2:
> * handle transparent huge pages
> * invert bit and rename shared -> exclusive (less confusing name)
> ---
>  Documentation/vm/pagemap.txt |    3 ++-
>  fs/proc/task_mmu.c           |   10 ++++++++++
>  tools/vm/page-types.c        |   12 ++++++++++++
>  3 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index 6bfbc172cdb9..3cfbbb333ea1 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -16,7 +16,8 @@ There are three components to pagemap:
>      * Bits 0-4   swap type if swapped
>      * Bits 5-54  swap offset if swapped
>      * Bit  55    pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
> -    * Bits 56-60 zero
> +    * Bit  56    page exlusively mapped
> +    * Bits 57-60 zero
>      * Bit  61    page is file-page or shared-anon
>      * Bit  62    page swapped
>      * Bit  63    page present
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6dee68d013ff..29febec65de4 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -982,6 +982,7 @@ struct pagemapread {
>  #define PM_STATUS2(v2, x)   (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
>
>  #define __PM_SOFT_DIRTY      (1LL)
> +#define __PM_MMAP_EXCLUSIVE  (2LL)
>  #define PM_PRESENT          PM_STATUS(4LL)
>  #define PM_SWAP             PM_STATUS(2LL)
>  #define PM_FILE             PM_STATUS(1LL)
> @@ -1074,6 +1075,8 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
>
>         if (page && !PageAnon(page))
>                 flags |= PM_FILE;
> +       if (page && page_mapcount(page) == 1)
> +               flags2 |= __PM_MMAP_EXCLUSIVE;
>         if ((vma->vm_flags & VM_SOFTDIRTY))
>                 flags2 |= __PM_SOFT_DIRTY;
>
> @@ -1119,6 +1122,13 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>                 else
>                         pmd_flags2 = 0;
>
> +               if (pmd_present(*pmd)) {
> +                       struct page *page = pmd_page(*pmd);
> +
> +                       if (page_mapcount(page) == 1)
> +                               pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
> +               }
> +
>                 for (; addr != end; addr += PAGE_SIZE) {
>                         unsigned long offset;
>                         pagemap_entry_t pme;
> diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
> index 8bdf16b8ba60..3a9f193526ee 100644
> --- a/tools/vm/page-types.c
> +++ b/tools/vm/page-types.c
> @@ -70,9 +70,12 @@
>  #define PM_PFRAME(x)        ((x) & PM_PFRAME_MASK)
>
>  #define __PM_SOFT_DIRTY      (1LL)
> +#define __PM_MMAP_EXCLUSIVE  (2LL)
>  #define PM_PRESENT          PM_STATUS(4LL)
>  #define PM_SWAP             PM_STATUS(2LL)
> +#define PM_FILE             PM_STATUS(1LL)
>  #define PM_SOFT_DIRTY       __PM_PSHIFT(__PM_SOFT_DIRTY)
> +#define PM_MMAP_EXCLUSIVE   __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)
>
>
>  /*
> @@ -100,6 +103,8 @@
>  #define KPF_SLOB_FREE          49
>  #define KPF_SLUB_FROZEN                50
>  #define KPF_SLUB_DEBUG         51
> +#define KPF_FILE               62
> +#define KPF_MMAP_EXCLUSIVE     63
>
>  #define KPF_ALL_BITS           ((uint64_t)~0ULL)
>  #define KPF_HACKERS_BITS       (0xffffULL << 32)
> @@ -149,6 +154,9 @@ static const char * const page_flag_names[] = {
>         [KPF_SLOB_FREE]         = "P:slob_free",
>         [KPF_SLUB_FROZEN]       = "A:slub_frozen",
>         [KPF_SLUB_DEBUG]        = "E:slub_debug",
> +
> +       [KPF_FILE]              = "F:file",
> +       [KPF_MMAP_EXCLUSIVE]    = "1:mmap_exclusive",
>  };
>
>
> @@ -452,6 +460,10 @@ static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)
>
>         if (pme & PM_SOFT_DIRTY)
>                 flags |= BIT(SOFTDIRTY);
> +       if (pme & PM_FILE)
> +               flags |= BIT(FILE);
> +       if (pme & PM_MMAP_EXCLUSIVE)
> +               flags |= BIT(MMAP_EXCLUSIVE);
>
>         return flags;
>  }
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users
  2015-05-12  9:43 ` [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users Konstantin Khlebnikov
  2015-05-12 11:22   ` Mark Williamson
@ 2015-05-12 15:06   ` Linus Torvalds
  2015-05-12 15:41     ` Konstantin Khlebnikov
  1 sibling, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2015-05-12 15:06 UTC (permalink / raw
  To: Konstantin Khlebnikov
  Cc: linux-mm, Naoya Horiguchi, Linux Kernel Mailing List,
	Andrew Morton, Mark Williamson, Pavel Emelyanov, Linux API,
	Andy Lutomirski, Vlastimil Babka, Pavel Machek, Mark Seaborn,
	Kirill A. Shutemov, Daniel James, Finn Grimwood

On Tue, May 12, 2015 at 2:43 AM, Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
> @@ -1260,6 +1269,8 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
>         if (!count)
>                 goto out_task;
>
> +       /* do not disclose physical addresses: attack vector */
> +       pm.show_pfn = capable(CAP_SYS_ADMIN);
>         pm.v2 = soft_dirty_cleared;
>         pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>         pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);

NO! Dammit, no, no, no!

How many times must people do this major security faux-pas before we learn?

WE DO NOT CHECK CURRENT CAPABILITIES AT READ/WRITE TIME!

It's a bug. It's a security issue. It's not how Unix capabilities work!

Capabilities are checked at open time.:

> @@ -1335,9 +1346,6 @@ out:
>
>  static int pagemap_open(struct inode *inode, struct file *file)
>  {
> -       /* do not disclose physical addresses: attack vector */
> -       if (!capable(CAP_SYS_ADMIN))
> -               return -EPERM;

THIS  is where you are supposed to check for capabilities. The place
where you removed it!

The reason we check capabilities at open time, and open time ONLY is
because that is really very integral to the whole Unix security model.
Otherwise, you get into this situation:

 - unprivileged process opens file

 - unprivileged process tricks suid process to do the actual access for it

where the traditional model is to just force a "write()" by opening
the file as stderr, and then executing a suid process (traditionally
"sudo") that writes an error message to it.

So *don't* do permission checks using read/write time credentials.
They are wrong.

Now, if there is some reason that you really can't do it when opening
the file, and you actually need to use capability information at
read/write time, you use the "file->f_cred" field, which is the
open-time capabilities. So you _can_ do permission checks at
read/write time, but you have to use the credentials of the opener,
not "current".

So in this case, I guess you could use

        pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);

if you really need to do this at read time, and cannot fill in that
"show_pfn" at open-time.

                        Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users
  2015-05-12 15:06   ` Linus Torvalds
@ 2015-05-12 15:41     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 18+ messages in thread
From: Konstantin Khlebnikov @ 2015-05-12 15:41 UTC (permalink / raw
  To: Linus Torvalds
  Cc: linux-mm, Naoya Horiguchi, Linux Kernel Mailing List,
	Andrew Morton, Mark Williamson, Pavel Emelyanov, Linux API,
	Andy Lutomirski, Vlastimil Babka, Pavel Machek, Mark Seaborn,
	Kirill A. Shutemov, Daniel James, Finn Grimwood

On 12.05.2015 18:06, Linus Torvalds wrote:
> On Tue, May 12, 2015 at 2:43 AM, Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>> @@ -1260,6 +1269,8 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
>>          if (!count)
>>                  goto out_task;
>>
>> +       /* do not disclose physical addresses: attack vector */
>> +       pm.show_pfn = capable(CAP_SYS_ADMIN);
>>          pm.v2 = soft_dirty_cleared;
>>          pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>>          pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
>
> NO! Dammit, no, no, no!
>
> How many times must people do this major security faux-pas before we learn?

Oops. Sorry. I guess everybody must do that mistake at least once.
That's my first time. =)


So, in this case existing call of mm_access() from pagemap_read()
is a bug too because it checks CAP_SYS_PTRACE for current task.

I'll rework it in the same way as /proc/*/[s]maps.

>
> WE DO NOT CHECK CURRENT CAPABILITIES AT READ/WRITE TIME!
>
> It's a bug. It's a security issue. It's not how Unix capabilities work!
>
> Capabilities are checked at open time.:
>
>> @@ -1335,9 +1346,6 @@ out:
>>
>>   static int pagemap_open(struct inode *inode, struct file *file)
>>   {
>> -       /* do not disclose physical addresses: attack vector */
>> -       if (!capable(CAP_SYS_ADMIN))
>> -               return -EPERM;
>
> THIS  is where you are supposed to check for capabilities. The place
> where you removed it!
>
> The reason we check capabilities at open time, and open time ONLY is
> because that is really very integral to the whole Unix security model.
> Otherwise, you get into this situation:
>
>   - unprivileged process opens file
>
>   - unprivileged process tricks suid process to do the actual access for it
>
> where the traditional model is to just force a "write()" by opening
> the file as stderr, and then executing a suid process (traditionally
> "sudo") that writes an error message to it.
>
> So *don't* do permission checks using read/write time credentials.
> They are wrong.
>
> Now, if there is some reason that you really can't do it when opening
> the file, and you actually need to use capability information at
> read/write time, you use the "file->f_cred" field, which is the
> open-time capabilities. So you _can_ do permission checks at
> read/write time, but you have to use the credentials of the opener,
> not "current".
>
> So in this case, I guess you could use
>
>          pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);
>
> if you really need to do this at read time, and cannot fill in that
> "show_pfn" at open-time.
>
>                          Linus
>


-- 
Konstantin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here
  2015-05-12 12:05   ` Mark Williamson
@ 2015-05-13 10:51     ` Konstantin Khlebnikov
  2015-05-14 18:50       ` Mark Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Konstantin Khlebnikov @ 2015-05-13 10:51 UTC (permalink / raw
  To: Mark Williamson
  Cc: linux-mm, Naoya Horiguchi, kernel list, Andrew Morton,
	Pavel Emelyanov, Linux API, Andy Lutomirski, Vlastimil Babka,
	Pavel Machek, Mark Seaborn, Kirill A. Shutemov, Linus Torvalds,
	Daniel James, Finn Grimwood

On 12.05.2015 15:05, Mark Williamson wrote:
> Hi Konstantin,
>
> I hope you won't mind me thinking out loud here on the idea of adding
> a flag to the v2 pagemap fields...  From a kernel PoV, I agree that
> this seems like the cleanest approach.  However, with my application
> developer hat on:
>
>   1. I was hoping we'd be able to backport a compatible fix to older
> kernels that might adopt the pagemap permissions change.  Using the V2
> format flags rules out doing this for kernels that are too old to have
> soft-dirty, I think.
 >
>   2. From our software's PoV, I feel it's worth noting that it doesn't
> strictly fix ABI compatibility, though I realise that's probably not
> your primary concern here.  We'll need to modify our code to write the
> clear_refs file but that change is OK for us if it's the preferred
> solution.
>
> In the patches I've been playing with, I was considering putting the
> Exclusive flag in the now-unused PFN field of the pagemap entries.
> Since we're specifically trying to work around for the lack of PFN
> information, would there be any appetite for mirroring this flag
> unconditionally into the now-empty PFN field (i.e. whether using v1 or
> v2 flags) when accessed by an unprivileged process?
>
> I realise it's ugly from a kernel PoV and I feel a little bad for
> suggesting it - but it would address points 1 and 2 for us (our
> existing code just looks for changes in the pagemap entry, so sticking
> the flag in there would cause it to do the right thing).
>
> I'm sorry to raise application-specific issues at this point; I
> appreciate that your primary concern is to improve the kernel and
> technically I like the approach that you've taken!  I'll try and
> provide more code-oriented feedback once I've tried out the changes.

I prefer to backport v2 format (except soft-dirty bit and clear_refs)
into older kernels. Page-shift bits are barely used so nobody will see
the difference.

-- 
Konstantin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here
  2015-05-12 10:40   ` Kirill A. Shutemov
@ 2015-05-13 10:59     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 18+ messages in thread
From: Konstantin Khlebnikov @ 2015-05-13 10:59 UTC (permalink / raw
  To: Kirill A. Shutemov
  Cc: linux-mm, Naoya Horiguchi, linux-kernel, Andrew Morton,
	Mark Williamson, Pavel Emelyanov, linux-api, Andy Lutomirski,
	Vlastimil Babka, Pavel Machek, Mark Seaborn, Linus Torvalds,
	Daniel James, Finn Grimwood

On 12.05.2015 13:40, Kirill A. Shutemov wrote:
> On Tue, May 12, 2015 at 12:43:03PM +0300, Konstantin Khlebnikov wrote:
>> This patch sets bit 56 in pagemap if this page is mapped only once.
>> It allows to detect exclusively used pages without exposing PFN:
>>
>> present file exclusive state
>> 0       0    0         non-present
>> 1       1    0         file page mapped somewhere else
>> 1       1    1         file page mapped only here
>> 1       0    0         anon non-CoWed page (shared with parent/child)
>> 1       0    1         anon CoWed page (or never forked)
>
> Probably, worth noting that file-private pages are anon in this context.
>

You mean there's another kind of CoW pages? Yep, but from the kernel
point of view these pages are the same. Anyway Userspace could look
into /proc/*/maps and see is there any file beyond anon vma.

-- 
Konstantin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/3] pagemap: switch to the new format and do some cleanup
  2015-05-12 10:54   ` Kirill A. Shutemov
@ 2015-05-13 11:39     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 18+ messages in thread
From: Konstantin Khlebnikov @ 2015-05-13 11:39 UTC (permalink / raw
  To: Kirill A. Shutemov
  Cc: linux-mm, Naoya Horiguchi, linux-kernel, Andrew Morton,
	Mark Williamson, Pavel Emelyanov, linux-api, Andy Lutomirski,
	Vlastimil Babka, Pavel Machek, Mark Seaborn, Linus Torvalds,
	Daniel James, Finn Grimwood

On 12.05.2015 13:54, Kirill A. Shutemov wrote:
> On Tue, May 12, 2015 at 12:43:06PM +0300, Konstantin Khlebnikov wrote:
>> This patch removes page-shift bits (scheduled to remove since 3.11) and
>> completes migration to the new bit layout. Also it cleans messy macro.
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> ---
>>   fs/proc/task_mmu.c    |  152 ++++++++++++++++---------------------------------
>>   tools/vm/page-types.c |   29 +++------
>>   2 files changed, 58 insertions(+), 123 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 0b7a8ffec95f..66bc7207ce90 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -710,23 +710,6 @@ const struct file_operations proc_tid_smaps_operations = {
>>   	.release	= proc_map_release,
>>   };
>>
>> -/*
>> - * We do not want to have constant page-shift bits sitting in
>> - * pagemap entries and are about to reuse them some time soon.
>> - *
>> - * Here's the "migration strategy":
>> - * 1. when the system boots these bits remain what they are,
>> - *    but a warning about future change is printed in log;
>> - * 2. once anyone clears soft-dirty bits via clear_refs file,
>> - *    these flag is set to denote, that user is aware of the
>> - *    new API and those page-shift bits change their meaning.
>> - *    The respective warning is printed in dmesg;
>> - * 3. In a couple of releases we will remove all the mentions
>> - *    of page-shift in pagemap entries.
>> - */
>
> Wouldn't it be better to just have v2=1 by default for couple releases to
> see if anything breaks? This way we can revert easily if regression reported.
> I guess someone could miss this change coming if he didn't touch clear_refs.
>

I don't believe that constant PAGE_SHIFT bits are used by anybody. 
Recent change of permissions was much more destructive and there is just
one report about that. Kernel prints message at first pagemap open for
ten releases. I think that's enough.

-- 
Konstantin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/3] pagemap: make useable for non-privilege users
  2015-05-12 11:13 ` [PATCH RFC 0/3] pagemap: make useable for non-privilege users Mark Williamson
@ 2015-05-14 18:40   ` Mark Williamson
  2015-06-08 12:53     ` Mark Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Williamson @ 2015-05-14 18:40 UTC (permalink / raw
  To: Konstantin Khlebnikov
  Cc: linux-mm, Naoya Horiguchi, kernel list, Andrew Morton,
	Pavel Emelyanov, Linux API, Andy Lutomirski, Vlastimil Babka,
	Pavel Machek, Mark Seaborn, Kirill A. Shutemov, Linus Torvalds,
	Daniel James, Finn Grimwood

Hi Konstantin,

I modified our code to check for the map-exclusive flag where it used
to compare pageframe numbers.  First tests look pretty promising, so
this patch looks like a viable approach for us.

Is there anything further we can do to help?

Thanks,
Mark

On Tue, May 12, 2015 at 12:13 PM, Mark Williamson
<mwilliamson@undo-software.com> wrote:
> Hi Konstantin,
>
> Thanks very much for continuing to look at this!  It's very much
> appreciated.  I've been investigating from our end but got caught up
> in some gnarly details of our pagemap-consuming code.
>
> I like the approach and it seems like the information you're exposing
> will be useful for our application.  I'll test the patch and see if it
> works for us as-is.
>
> Will follow up with any comments on the individual patches.
>
> Thanks,
> Mark
>
> On Tue, May 12, 2015 at 10:43 AM, Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>> This patchset tries to make pagemap useable again in the safe way.
>> First patch adds bit 'map-exlusive' which is set if page is mapped only here.
>> Second patch restores access for non-privileged users but hides pfn if task
>> has no capability CAP_SYS_ADMIN. Third patch removes page-shift bits and
>> completes migration to the new pagemap format (flags soft-dirty and
>> mmap-exlusive are available only in the new format).
>>
>> ---
>>
>> Konstantin Khlebnikov (3):
>>       pagemap: add mmap-exclusive bit for marking pages mapped only here
>>       pagemap: hide physical addresses from non-privileged users
>>       pagemap: switch to the new format and do some cleanup
>>
>>
>>  Documentation/vm/pagemap.txt |    3 -
>>  fs/proc/task_mmu.c           |  178 +++++++++++++++++-------------------------
>>  tools/vm/page-types.c        |   35 ++++----
>>  3 files changed, 91 insertions(+), 125 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here
  2015-05-13 10:51     ` Konstantin Khlebnikov
@ 2015-05-14 18:50       ` Mark Williamson
  2015-05-15  9:39         ` Konstantin Khlebnikov
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Williamson @ 2015-05-14 18:50 UTC (permalink / raw
  To: Konstantin Khlebnikov
  Cc: linux-mm, Naoya Horiguchi, kernel list, Andrew Morton,
	Pavel Emelyanov, Linux API, Andy Lutomirski, Vlastimil Babka,
	Pavel Machek, Mark Seaborn, Kirill A. Shutemov, Linus Torvalds,
	Daniel James, Finn Grimwood

Hi Konstantin,

On Wed, May 13, 2015 at 11:51 AM, Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
> On 12.05.2015 15:05, Mark Williamson wrote:
<snip>
>>   1. I was hoping we'd be able to backport a compatible fix to older
>> kernels that might adopt the pagemap permissions change.  Using the V2
>> format flags rules out doing this for kernels that are too old to have
>> soft-dirty, I think.
>>
>>   2. From our software's PoV, I feel it's worth noting that it doesn't
>> strictly fix ABI compatibility, though I realise that's probably not
>> your primary concern here.  We'll need to modify our code to write the
>> clear_refs file but that change is OK for us if it's the preferred
>> solution.
<snip>
> I prefer to backport v2 format (except soft-dirty bit and clear_refs)
> into older kernels. Page-shift bits are barely used so nobody will see
> the difference.

My concern was whether a change to format would be acceptable to
include in the various -stable kernels; they are already including the
additional protections on pagemap, so we're starting to need our
fallback mode in distributions.  Do you think that such a patch would
be acceptable there?

(As an application vendor we're likely to be particularly stuck with
what the commercial distributions decide to ship, which is why I'm
trying to keep an eye on this)

I appreciate that this is a slightly administrative concern!  I
definitely like the technical approach of this code and it seems to
work fine for us.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here
  2015-05-14 18:50       ` Mark Williamson
@ 2015-05-15  9:39         ` Konstantin Khlebnikov
  0 siblings, 0 replies; 18+ messages in thread
From: Konstantin Khlebnikov @ 2015-05-15  9:39 UTC (permalink / raw
  To: Mark Williamson
  Cc: linux-mm, Naoya Horiguchi, kernel list, Andrew Morton,
	Pavel Emelyanov, Linux API, Andy Lutomirski, Vlastimil Babka,
	Pavel Machek, Mark Seaborn, Kirill A. Shutemov, Linus Torvalds,
	Daniel James, Finn Grimwood

On 14.05.2015 21:50, Mark Williamson wrote:
> Hi Konstantin,
>
> On Wed, May 13, 2015 at 11:51 AM, Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>> On 12.05.2015 15:05, Mark Williamson wrote:
> <snip>
>>>    1. I was hoping we'd be able to backport a compatible fix to older
>>> kernels that might adopt the pagemap permissions change.  Using the V2
>>> format flags rules out doing this for kernels that are too old to have
>>> soft-dirty, I think.
>>>
>>>    2. From our software's PoV, I feel it's worth noting that it doesn't
>>> strictly fix ABI compatibility, though I realise that's probably not
>>> your primary concern here.  We'll need to modify our code to write the
>>> clear_refs file but that change is OK for us if it's the preferred
>>> solution.
> <snip>
>> I prefer to backport v2 format (except soft-dirty bit and clear_refs)
>> into older kernels. Page-shift bits are barely used so nobody will see
>> the difference.
>
> My concern was whether a change to format would be acceptable to
> include in the various -stable kernels; they are already including the
> additional protections on pagemap, so we're starting to need our
> fallback mode in distributions.  Do you think that such a patch would
> be acceptable there?
>
> (As an application vendor we're likely to be particularly stuck with
> what the commercial distributions decide to ship, which is why I'm
> trying to keep an eye on this)
>
> I appreciate that this is a slightly administrative concern!  I
> definitely like the technical approach of this code and it seems to
> work fine for us.

I cannot guarantee that v2 format will be accepted into stable kernels
and into distributives. I'm not the gate keeper.

As a fallback probably you should invent some kind of suid helper
which gives you access to required information without exposing pfn.
For example: it gets pids and memory ranges as arguments and prints
bitmap of CoWed pages into stdout.

-- 
Konstantin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/3] pagemap: make useable for non-privilege users
  2015-05-14 18:40   ` Mark Williamson
@ 2015-06-08 12:53     ` Mark Williamson
  0 siblings, 0 replies; 18+ messages in thread
From: Mark Williamson @ 2015-06-08 12:53 UTC (permalink / raw
  To: Konstantin Khlebnikov
  Cc: linux-mm, Naoya Horiguchi, kernel list, Andrew Morton,
	Pavel Emelyanov, Linux API, Andy Lutomirski, Vlastimil Babka,
	Pavel Machek, Mark Seaborn, Kirill A. Shutemov, Linus Torvalds,
	Daniel James, Finn Grimwood

Hi Konstantin,

Would you still be intending to re-submit this patch?  We'd be quite
keen to assist, so if there's anything further I can do please let me
know!

Just to re-confirm - we do think that the patch will solve our problem
(relatively minor changes required on our side).

Thanks,
Mark

On Thu, May 14, 2015 at 7:40 PM, Mark Williamson
<mwilliamson@undo-software.com> wrote:
> Hi Konstantin,
>
> I modified our code to check for the map-exclusive flag where it used
> to compare pageframe numbers.  First tests look pretty promising, so
> this patch looks like a viable approach for us.
>
> Is there anything further we can do to help?
>
> Thanks,
> Mark
>
> On Tue, May 12, 2015 at 12:13 PM, Mark Williamson
> <mwilliamson@undo-software.com> wrote:
>> Hi Konstantin,
>>
>> Thanks very much for continuing to look at this!  It's very much
>> appreciated.  I've been investigating from our end but got caught up
>> in some gnarly details of our pagemap-consuming code.
>>
>> I like the approach and it seems like the information you're exposing
>> will be useful for our application.  I'll test the patch and see if it
>> works for us as-is.
>>
>> Will follow up with any comments on the individual patches.
>>
>> Thanks,
>> Mark
>>
>> On Tue, May 12, 2015 at 10:43 AM, Konstantin Khlebnikov
>> <khlebnikov@yandex-team.ru> wrote:
>>> This patchset tries to make pagemap useable again in the safe way.
>>> First patch adds bit 'map-exlusive' which is set if page is mapped only here.
>>> Second patch restores access for non-privileged users but hides pfn if task
>>> has no capability CAP_SYS_ADMIN. Third patch removes page-shift bits and
>>> completes migration to the new pagemap format (flags soft-dirty and
>>> mmap-exlusive are available only in the new format).
>>>
>>> ---
>>>
>>> Konstantin Khlebnikov (3):
>>>       pagemap: add mmap-exclusive bit for marking pages mapped only here
>>>       pagemap: hide physical addresses from non-privileged users
>>>       pagemap: switch to the new format and do some cleanup
>>>
>>>
>>>  Documentation/vm/pagemap.txt |    3 -
>>>  fs/proc/task_mmu.c           |  178 +++++++++++++++++-------------------------
>>>  tools/vm/page-types.c        |   35 ++++----
>>>  3 files changed, 91 insertions(+), 125 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-06-08 12:53 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-12  9:43 [PATCH RFC 0/3] pagemap: make useable for non-privilege users Konstantin Khlebnikov
2015-05-12  9:43 ` [PATCH v2 1/3] pagemap: add mmap-exclusive bit for marking pages mapped only here Konstantin Khlebnikov
2015-05-12 10:40   ` Kirill A. Shutemov
2015-05-13 10:59     ` Konstantin Khlebnikov
2015-05-12 12:05   ` Mark Williamson
2015-05-13 10:51     ` Konstantin Khlebnikov
2015-05-14 18:50       ` Mark Williamson
2015-05-15  9:39         ` Konstantin Khlebnikov
2015-05-12  9:43 ` [PATCH v2 2/3] pagemap: hide physical addresses from non-privileged users Konstantin Khlebnikov
2015-05-12 11:22   ` Mark Williamson
2015-05-12 15:06   ` Linus Torvalds
2015-05-12 15:41     ` Konstantin Khlebnikov
2015-05-12  9:43 ` [PATCH v2 3/3] pagemap: switch to the new format and do some cleanup Konstantin Khlebnikov
2015-05-12 10:54   ` Kirill A. Shutemov
2015-05-13 11:39     ` Konstantin Khlebnikov
2015-05-12 11:13 ` [PATCH RFC 0/3] pagemap: make useable for non-privilege users Mark Williamson
2015-05-14 18:40   ` Mark Williamson
2015-06-08 12:53     ` Mark Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).