All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3] xen: remove some memory limits from pv-domains
@ 2014-09-17 14:59 Juergen Gross
  2014-09-17 14:59 ` [PATCH V3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
  0 siblings, 1 reply; 9+ messages in thread
From: Juergen Gross @ 2014-09-17 14:59 UTC (permalink / raw
  To: linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky,
	david.vrabel, jbeulich
  Cc: Juergen Gross

When a Xen pv-domain is booted the initial memory map contains multiple
objects in the top 2 GB including the initrd and the p2m list. This
limits the supported maximum size of the initrd and the maximum
memory size the p2m list can span is limited to about 500 GB.

Xen however supports loading the initrd without mapping it and the
initial p2m list can be mapped by Xen to an arbitrary selected virtual
address. The following patches activate those options and thus remove
the limitations.

It should be noted that the p2m list limitation isn't only affecting
the amount of memory a pv domain can use, but it also hinders Dom0 to
be started on physical systems with larger memory without reducing it's
memory via a Xen boot parameter. By mapping the initial p2m list to
an area not in the top 2 GB it is now possible to boot Dom0 on such
systems.

It would be desirable to be able to use more than 512 GB in a pv
domain, but this would require a reorganization of the p2m tree built 
by the kernel at boot time. As this reorganization would affect the
Xen tools and kexec, too, it is not included in this patch set. This
topic can be addressed later.

Changes since V2:
- 2 patches dropped, as they are already applied
- added helper function and updated comment and patch description as
  requested by David Vrabel

Juergen Gross (1):
  xen: eliminate scalability issues from initial mapping setup

 arch/x86/xen/mmu.c      | 119 +++++++++++++++++++++++++++++++++++++++++++++---
 arch/x86/xen/setup.c    |  65 ++++++++++++++------------
 arch/x86/xen/xen-head.S |   2 +
 3 files changed, 151 insertions(+), 35 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH V3] xen: eliminate scalability issues from initial mapping setup
  2014-09-17 14:59 [PATCH V3] xen: remove some memory limits from pv-domains Juergen Gross
@ 2014-09-17 14:59 ` Juergen Gross
  2014-09-23  3:58   ` Juergen Gross
  2014-09-24 13:20     ` David Vrabel
  0 siblings, 2 replies; 9+ messages in thread
From: Juergen Gross @ 2014-09-17 14:59 UTC (permalink / raw
  To: linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky,
	david.vrabel, jbeulich
  Cc: Juergen Gross

Direct Xen to place the initial P->M table outside of the initial
mapping, as otherwise the 1G (implementation) / 2G (theoretical)
restriction on the size of the initial mapping limits the amount
of memory a domain can be handed initially.

As the initial P->M table is copied rather early during boot to
domain private memory and it's initial virtual mapping is dropped,
the easiest way to avoid virtual address conflicts with other
addresses in the kernel is to use a user address area for the
virtual address of the initial P->M table. This allows us to just
throw away the page tables of the initial mapping after the copy
without having to care about address invalidation.

It should be noted that this patch won't enable a pv-domain to USE
more than 512 GB of RAM. It just enables it to be started with a
P->M table covering more memory. This is especially important for
being able to boot a Dom0 on a system with more than 512 GB memory.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
 arch/x86/xen/mmu.c      | 119 +++++++++++++++++++++++++++++++++++++++++++++---
 arch/x86/xen/setup.c    |  65 ++++++++++++++------------
 arch/x86/xen/xen-head.S |   2 +
 3 files changed, 151 insertions(+), 35 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 16fb009..3bd403b 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1198,6 +1198,78 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
 	 * instead of somewhere later and be confusing. */
 	xen_mc_flush();
 }
+
+/*
+ * Make a page range writeable and free it.
+ */
+static void __init xen_free_ro_pages(unsigned long paddr, unsigned long size)
+{
+	void *vaddr = __va(paddr);
+	void *vaddr_end = vaddr + size;
+
+	for (; vaddr < vaddr_end; vaddr += PAGE_SIZE)
+		make_lowmem_page_readwrite(vaddr);
+
+	memblock_free(paddr, size);
+}
+
+static void xen_cleanmfnmap_free_pgtbl(void *pgtbl)
+{
+	unsigned long pa = __pa(pgtbl) & PHYSICAL_PAGE_MASK;
+
+	ClearPagePinned(virt_to_page(__va(pa)));
+	xen_free_ro_pages(pa, PAGE_SIZE);
+}
+
+/*
+ * Since it is well isolated we can (and since it is perhaps large we should)
+ * also free the page tables mapping the initial P->M table.
+ */
+static void __init xen_cleanmfnmap(unsigned long vaddr)
+{
+	unsigned long va = vaddr & PMD_MASK;
+	unsigned long pa;
+	pgd_t *pgd = pgd_offset_k(va);
+	pud_t *pud_page = pud_offset(pgd, 0);
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	unsigned int i;
+
+	set_pgd(pgd, __pgd(0));
+	do {
+		pud = pud_page + pud_index(va);
+		if (pud_none(*pud)) {
+			va += PUD_SIZE;
+		} else if (pud_large(*pud)) {
+			pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
+			xen_free_ro_pages(pa, PUD_SIZE);
+			va += PUD_SIZE;
+		} else {
+			pmd = pmd_offset(pud, va);
+			if (pmd_large(*pmd)) {
+				pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
+				xen_free_ro_pages(pa, PMD_SIZE);
+			} else if (!pmd_none(*pmd)) {
+				pte = pte_offset_kernel(pmd, va);
+				for (i = 0; i < PTRS_PER_PTE; ++i) {
+					if (pte_none(pte[i]))
+						break;
+					pa = pte_pfn(pte[i]) << PAGE_SHIFT;
+					xen_free_ro_pages(pa, PAGE_SIZE);
+				}
+				xen_cleanmfnmap_free_pgtbl(pte);
+			}
+			va += PMD_SIZE;
+			if (pmd_index(va))
+				continue;
+			xen_cleanmfnmap_free_pgtbl(pmd);
+		}
+
+	} while (pud_index(va) || pmd_index(va));
+	xen_cleanmfnmap_free_pgtbl(pud_page);
+}
+
 static void __init xen_pagetable_p2m_copy(void)
 {
 	unsigned long size;
@@ -1217,18 +1289,23 @@ static void __init xen_pagetable_p2m_copy(void)
 	/* using __ka address and sticking INVALID_P2M_ENTRY! */
 	memset((void *)xen_start_info->mfn_list, 0xff, size);
 
-	/* We should be in __ka space. */
-	BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
 	addr = xen_start_info->mfn_list;
-	/* We roundup to the PMD, which means that if anybody at this stage is
+	/* We could be in __ka space.
+	 * We roundup to the PMD, which means that if anybody at this stage is
 	 * using the __ka address of xen_start_info or xen_start_info->shared_info
 	 * they are in going to crash. Fortunatly we have already revectored
 	 * in xen_setup_kernel_pagetable and in xen_setup_shared_info. */
 	size = roundup(size, PMD_SIZE);
-	xen_cleanhighmap(addr, addr + size);
 
-	size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
-	memblock_free(__pa(xen_start_info->mfn_list), size);
+	if (addr >= __START_KERNEL_map) {
+		xen_cleanhighmap(addr, addr + size);
+		size = PAGE_ALIGN(xen_start_info->nr_pages *
+				  sizeof(unsigned long));
+		memblock_free(__pa(addr), size);
+	} else {
+		xen_cleanmfnmap(addr);
+	}
+
 	/* And revector! Bye bye old array */
 	xen_start_info->mfn_list = new_mfn_list;
 
@@ -1529,6 +1606,24 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
 #else /* CONFIG_X86_64 */
 static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
 {
+	unsigned long pfn;
+
+	if (xen_feature(XENFEAT_writable_page_tables) ||
+	    xen_feature(XENFEAT_auto_translated_physmap) ||
+	    xen_start_info->mfn_list >= __START_KERNEL_map)
+		return pte;
+
+	/*
+	 * Pages belonging to the initial p2m list mapped outside the default
+	 * address range must be mapped read-only. This region contains the
+	 * page tables for mapping the p2m list, too, and page tables MUST be
+	 * mapped read-only.
+	 */
+	pfn = pte_pfn(pte);
+	if (pfn >= xen_start_info->first_p2m_pfn &&
+	    pfn < xen_start_info->first_p2m_pfn + xen_start_info->nr_p2m_frames)
+		pte = __pte_ma(pte_val_ma(pte) & ~_PAGE_RW);
+
 	return pte;
 }
 #endif /* CONFIG_X86_64 */
@@ -1884,7 +1979,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	 * mappings. Considering that on Xen after the kernel mappings we
 	 * have the mappings of some pages that don't exist in pfn space, we
 	 * set max_pfn_mapped to the last real pfn mapped. */
-	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
+	if (xen_start_info->mfn_list < __START_KERNEL_map)
+		max_pfn_mapped = xen_start_info->first_p2m_pfn;
+	else
+		max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
 
 	pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
 	pt_end = pt_base + xen_start_info->nr_pt_frames;
@@ -1924,6 +2022,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Graft it onto L4[511][510] */
 	copy_page(level2_kernel_pgt, l2);
 
+	/* Copy the initial P->M table mappings if necessary. */
+	i = pgd_index(xen_start_info->mfn_list);
+	if (i && i < pgd_index(__START_KERNEL_map))
+		init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
 		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
@@ -1964,6 +2067,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 
 	/* Our (by three pages) smaller Xen pagetable that we are using */
 	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+	/* protect xen_start_info */
+	memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
 	/* Revector the xen_start_info */
 	xen_start_info = (struct start_info *)__va(__pa(xen_start_info));
 }
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 2e555163..6412367 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -333,6 +333,41 @@ void xen_ignore_unusable(struct e820entry *list, size_t map_size)
 	}
 }
 
+/*
+ * Reserve Xen mfn_list.
+ * See comment above "struct start_info" in <xen/interface/xen.h>
+ * We tried to make the the memblock_reserve more selective so
+ * that it would be clear what region is reserved. Sadly we ran
+ * in the problem wherein on a 64-bit hypervisor with a 32-bit
+ * initial domain, the pt_base has the cr3 value which is not
+ * neccessarily where the pagetable starts! As Jan put it: "
+ * Actually, the adjustment turns out to be correct: The page
+ * tables for a 32-on-64 dom0 get allocated in the order "first L1",
+ * "first L2", "first L3", so the offset to the page table base is
+ * indeed 2. When reading xen/include/public/xen.h's comment
+ * very strictly, this is not a violation (since there nothing is said
+ * that the first thing in the page table space is pointed to by
+ * pt_base; I admit that this seems to be implied though, namely
+ * do I think that it is implied that the page table space is the
+ * range [pt_base, pt_base + nt_pt_frames), whereas that
+ * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
+ * which - without a priori knowledge - the kernel would have
+ * difficulty to figure out)." - so lets just fall back to the
+ * easy way and reserve the whole region.
+ */
+static void __init xen_reserve_xen_mfnlist(void)
+{
+	if (xen_start_info->mfn_list >= __START_KERNEL_map) {
+		memblock_reserve(__pa(xen_start_info->mfn_list),
+				 xen_start_info->pt_base -
+				 xen_start_info->mfn_list);
+		return;
+	}
+
+	memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
+			 PFN_PHYS(xen_start_info->nr_p2m_frames));
+}
+
 /**
  * machine_specific_memory_setup - Hook for machine specific memory setup.
  **/
@@ -467,32 +502,7 @@ char * __init xen_memory_setup(void)
 	e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
 			E820_RESERVED);
 
-	/*
-	 * Reserve Xen bits:
-	 *  - mfn_list
-	 *  - xen_start_info
-	 * See comment above "struct start_info" in <xen/interface/xen.h>
-	 * We tried to make the the memblock_reserve more selective so
-	 * that it would be clear what region is reserved. Sadly we ran
-	 * in the problem wherein on a 64-bit hypervisor with a 32-bit
-	 * initial domain, the pt_base has the cr3 value which is not
-	 * neccessarily where the pagetable starts! As Jan put it: "
-	 * Actually, the adjustment turns out to be correct: The page
-	 * tables for a 32-on-64 dom0 get allocated in the order "first L1",
-	 * "first L2", "first L3", so the offset to the page table base is
-	 * indeed 2. When reading xen/include/public/xen.h's comment
-	 * very strictly, this is not a violation (since there nothing is said
-	 * that the first thing in the page table space is pointed to by
-	 * pt_base; I admit that this seems to be implied though, namely
-	 * do I think that it is implied that the page table space is the
-	 * range [pt_base, pt_base + nt_pt_frames), whereas that
-	 * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
-	 * which - without a priori knowledge - the kernel would have
-	 * difficulty to figure out)." - so lets just fall back to the
-	 * easy way and reserve the whole region.
-	 */
-	memblock_reserve(__pa(xen_start_info->mfn_list),
-			 xen_start_info->pt_base - xen_start_info->mfn_list);
+	xen_reserve_xen_mfnlist();
 
 	sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
 
@@ -522,8 +532,7 @@ char * __init xen_auto_xlated_memory_setup(void)
 	for (i = 0; i < memmap.nr_entries; i++)
 		e820_add_region(map[i].addr, map[i].size, map[i].type);
 
-	memblock_reserve(__pa(xen_start_info->mfn_list),
-			 xen_start_info->pt_base - xen_start_info->mfn_list);
+	xen_reserve_xen_mfnlist();
 
 	return "Xen";
 }
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 46408e5..e7bd668 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -112,6 +112,8 @@ NEXT_HYPERCALL(arch_6)
 	ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      _ASM_PTR __PAGE_OFFSET)
 #else
 	ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      _ASM_PTR __START_KERNEL_map)
+	/* Map the p2m table to a 512GB-aligned user address. */
+	ELFNOTE(Xen, XEN_ELFNOTE_INIT_P2M,       .quad PGDIR_SIZE)
 #endif
 	ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          _ASM_PTR startup_xen)
 	ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, _ASM_PTR hypercall_page)
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH V3] xen: eliminate scalability issues from initial mapping setup
  2014-09-17 14:59 ` [PATCH V3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
@ 2014-09-23  3:58   ` Juergen Gross
  2014-09-23 13:10       ` David Vrabel
  2014-09-24 13:20     ` David Vrabel
  1 sibling, 1 reply; 9+ messages in thread
From: Juergen Gross @ 2014-09-23  3:58 UTC (permalink / raw
  To: linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky,
	david.vrabel, jbeulich

On 09/17/2014 04:59 PM, Juergen Gross wrote:
> Direct Xen to place the initial P->M table outside of the initial
> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
> restriction on the size of the initial mapping limits the amount
> of memory a domain can be handed initially.
>
> As the initial P->M table is copied rather early during boot to
> domain private memory and it's initial virtual mapping is dropped,
> the easiest way to avoid virtual address conflicts with other
> addresses in the kernel is to use a user address area for the
> virtual address of the initial P->M table. This allows us to just
> throw away the page tables of the initial mapping after the copy
> without having to care about address invalidation.
>
> It should be noted that this patch won't enable a pv-domain to USE
> more than 512 GB of RAM. It just enables it to be started with a
> P->M table covering more memory. This is especially important for
> being able to boot a Dom0 on a system with more than 512 GB memory.
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Any Acks/Naks?

Juergen

> ---
>   arch/x86/xen/mmu.c      | 119 +++++++++++++++++++++++++++++++++++++++++++++---
>   arch/x86/xen/setup.c    |  65 ++++++++++++++------------
>   arch/x86/xen/xen-head.S |   2 +
>   3 files changed, 151 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 16fb009..3bd403b 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1198,6 +1198,78 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
>   	 * instead of somewhere later and be confusing. */
>   	xen_mc_flush();
>   }
> +
> +/*
> + * Make a page range writeable and free it.
> + */
> +static void __init xen_free_ro_pages(unsigned long paddr, unsigned long size)
> +{
> +	void *vaddr = __va(paddr);
> +	void *vaddr_end = vaddr + size;
> +
> +	for (; vaddr < vaddr_end; vaddr += PAGE_SIZE)
> +		make_lowmem_page_readwrite(vaddr);
> +
> +	memblock_free(paddr, size);
> +}
> +
> +static void xen_cleanmfnmap_free_pgtbl(void *pgtbl)
> +{
> +	unsigned long pa = __pa(pgtbl) & PHYSICAL_PAGE_MASK;
> +
> +	ClearPagePinned(virt_to_page(__va(pa)));
> +	xen_free_ro_pages(pa, PAGE_SIZE);
> +}
> +
> +/*
> + * Since it is well isolated we can (and since it is perhaps large we should)
> + * also free the page tables mapping the initial P->M table.
> + */
> +static void __init xen_cleanmfnmap(unsigned long vaddr)
> +{
> +	unsigned long va = vaddr & PMD_MASK;
> +	unsigned long pa;
> +	pgd_t *pgd = pgd_offset_k(va);
> +	pud_t *pud_page = pud_offset(pgd, 0);
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	unsigned int i;
> +
> +	set_pgd(pgd, __pgd(0));
> +	do {
> +		pud = pud_page + pud_index(va);
> +		if (pud_none(*pud)) {
> +			va += PUD_SIZE;
> +		} else if (pud_large(*pud)) {
> +			pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
> +			xen_free_ro_pages(pa, PUD_SIZE);
> +			va += PUD_SIZE;
> +		} else {
> +			pmd = pmd_offset(pud, va);
> +			if (pmd_large(*pmd)) {
> +				pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
> +				xen_free_ro_pages(pa, PMD_SIZE);
> +			} else if (!pmd_none(*pmd)) {
> +				pte = pte_offset_kernel(pmd, va);
> +				for (i = 0; i < PTRS_PER_PTE; ++i) {
> +					if (pte_none(pte[i]))
> +						break;
> +					pa = pte_pfn(pte[i]) << PAGE_SHIFT;
> +					xen_free_ro_pages(pa, PAGE_SIZE);
> +				}
> +				xen_cleanmfnmap_free_pgtbl(pte);
> +			}
> +			va += PMD_SIZE;
> +			if (pmd_index(va))
> +				continue;
> +			xen_cleanmfnmap_free_pgtbl(pmd);
> +		}
> +
> +	} while (pud_index(va) || pmd_index(va));
> +	xen_cleanmfnmap_free_pgtbl(pud_page);
> +}
> +
>   static void __init xen_pagetable_p2m_copy(void)
>   {
>   	unsigned long size;
> @@ -1217,18 +1289,23 @@ static void __init xen_pagetable_p2m_copy(void)
>   	/* using __ka address and sticking INVALID_P2M_ENTRY! */
>   	memset((void *)xen_start_info->mfn_list, 0xff, size);
>
> -	/* We should be in __ka space. */
> -	BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
>   	addr = xen_start_info->mfn_list;
> -	/* We roundup to the PMD, which means that if anybody at this stage is
> +	/* We could be in __ka space.
> +	 * We roundup to the PMD, which means that if anybody at this stage is
>   	 * using the __ka address of xen_start_info or xen_start_info->shared_info
>   	 * they are in going to crash. Fortunatly we have already revectored
>   	 * in xen_setup_kernel_pagetable and in xen_setup_shared_info. */
>   	size = roundup(size, PMD_SIZE);
> -	xen_cleanhighmap(addr, addr + size);
>
> -	size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
> -	memblock_free(__pa(xen_start_info->mfn_list), size);
> +	if (addr >= __START_KERNEL_map) {
> +		xen_cleanhighmap(addr, addr + size);
> +		size = PAGE_ALIGN(xen_start_info->nr_pages *
> +				  sizeof(unsigned long));
> +		memblock_free(__pa(addr), size);
> +	} else {
> +		xen_cleanmfnmap(addr);
> +	}
> +
>   	/* And revector! Bye bye old array */
>   	xen_start_info->mfn_list = new_mfn_list;
>
> @@ -1529,6 +1606,24 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
>   #else /* CONFIG_X86_64 */
>   static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
>   {
> +	unsigned long pfn;
> +
> +	if (xen_feature(XENFEAT_writable_page_tables) ||
> +	    xen_feature(XENFEAT_auto_translated_physmap) ||
> +	    xen_start_info->mfn_list >= __START_KERNEL_map)
> +		return pte;
> +
> +	/*
> +	 * Pages belonging to the initial p2m list mapped outside the default
> +	 * address range must be mapped read-only. This region contains the
> +	 * page tables for mapping the p2m list, too, and page tables MUST be
> +	 * mapped read-only.
> +	 */
> +	pfn = pte_pfn(pte);
> +	if (pfn >= xen_start_info->first_p2m_pfn &&
> +	    pfn < xen_start_info->first_p2m_pfn + xen_start_info->nr_p2m_frames)
> +		pte = __pte_ma(pte_val_ma(pte) & ~_PAGE_RW);
> +
>   	return pte;
>   }
>   #endif /* CONFIG_X86_64 */
> @@ -1884,7 +1979,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>   	 * mappings. Considering that on Xen after the kernel mappings we
>   	 * have the mappings of some pages that don't exist in pfn space, we
>   	 * set max_pfn_mapped to the last real pfn mapped. */
> -	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
> +	if (xen_start_info->mfn_list < __START_KERNEL_map)
> +		max_pfn_mapped = xen_start_info->first_p2m_pfn;
> +	else
> +		max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
>
>   	pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
>   	pt_end = pt_base + xen_start_info->nr_pt_frames;
> @@ -1924,6 +2022,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>   	/* Graft it onto L4[511][510] */
>   	copy_page(level2_kernel_pgt, l2);
>
> +	/* Copy the initial P->M table mappings if necessary. */
> +	i = pgd_index(xen_start_info->mfn_list);
> +	if (i && i < pgd_index(__START_KERNEL_map))
> +		init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
> +
>   	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
>   		/* Make pagetable pieces RO */
>   		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
> @@ -1964,6 +2067,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>
>   	/* Our (by three pages) smaller Xen pagetable that we are using */
>   	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
> +	/* protect xen_start_info */
> +	memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
>   	/* Revector the xen_start_info */
>   	xen_start_info = (struct start_info *)__va(__pa(xen_start_info));
>   }
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 2e555163..6412367 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -333,6 +333,41 @@ void xen_ignore_unusable(struct e820entry *list, size_t map_size)
>   	}
>   }
>
> +/*
> + * Reserve Xen mfn_list.
> + * See comment above "struct start_info" in <xen/interface/xen.h>
> + * We tried to make the the memblock_reserve more selective so
> + * that it would be clear what region is reserved. Sadly we ran
> + * in the problem wherein on a 64-bit hypervisor with a 32-bit
> + * initial domain, the pt_base has the cr3 value which is not
> + * neccessarily where the pagetable starts! As Jan put it: "
> + * Actually, the adjustment turns out to be correct: The page
> + * tables for a 32-on-64 dom0 get allocated in the order "first L1",
> + * "first L2", "first L3", so the offset to the page table base is
> + * indeed 2. When reading xen/include/public/xen.h's comment
> + * very strictly, this is not a violation (since there nothing is said
> + * that the first thing in the page table space is pointed to by
> + * pt_base; I admit that this seems to be implied though, namely
> + * do I think that it is implied that the page table space is the
> + * range [pt_base, pt_base + nt_pt_frames), whereas that
> + * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
> + * which - without a priori knowledge - the kernel would have
> + * difficulty to figure out)." - so lets just fall back to the
> + * easy way and reserve the whole region.
> + */
> +static void __init xen_reserve_xen_mfnlist(void)
> +{
> +	if (xen_start_info->mfn_list >= __START_KERNEL_map) {
> +		memblock_reserve(__pa(xen_start_info->mfn_list),
> +				 xen_start_info->pt_base -
> +				 xen_start_info->mfn_list);
> +		return;
> +	}
> +
> +	memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
> +			 PFN_PHYS(xen_start_info->nr_p2m_frames));
> +}
> +
>   /**
>    * machine_specific_memory_setup - Hook for machine specific memory setup.
>    **/
> @@ -467,32 +502,7 @@ char * __init xen_memory_setup(void)
>   	e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
>   			E820_RESERVED);
>
> -	/*
> -	 * Reserve Xen bits:
> -	 *  - mfn_list
> -	 *  - xen_start_info
> -	 * See comment above "struct start_info" in <xen/interface/xen.h>
> -	 * We tried to make the the memblock_reserve more selective so
> -	 * that it would be clear what region is reserved. Sadly we ran
> -	 * in the problem wherein on a 64-bit hypervisor with a 32-bit
> -	 * initial domain, the pt_base has the cr3 value which is not
> -	 * neccessarily where the pagetable starts! As Jan put it: "
> -	 * Actually, the adjustment turns out to be correct: The page
> -	 * tables for a 32-on-64 dom0 get allocated in the order "first L1",
> -	 * "first L2", "first L3", so the offset to the page table base is
> -	 * indeed 2. When reading xen/include/public/xen.h's comment
> -	 * very strictly, this is not a violation (since there nothing is said
> -	 * that the first thing in the page table space is pointed to by
> -	 * pt_base; I admit that this seems to be implied though, namely
> -	 * do I think that it is implied that the page table space is the
> -	 * range [pt_base, pt_base + nt_pt_frames), whereas that
> -	 * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
> -	 * which - without a priori knowledge - the kernel would have
> -	 * difficulty to figure out)." - so lets just fall back to the
> -	 * easy way and reserve the whole region.
> -	 */
> -	memblock_reserve(__pa(xen_start_info->mfn_list),
> -			 xen_start_info->pt_base - xen_start_info->mfn_list);
> +	xen_reserve_xen_mfnlist();
>
>   	sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
>
> @@ -522,8 +532,7 @@ char * __init xen_auto_xlated_memory_setup(void)
>   	for (i = 0; i < memmap.nr_entries; i++)
>   		e820_add_region(map[i].addr, map[i].size, map[i].type);
>
> -	memblock_reserve(__pa(xen_start_info->mfn_list),
> -			 xen_start_info->pt_base - xen_start_info->mfn_list);
> +	xen_reserve_xen_mfnlist();
>
>   	return "Xen";
>   }
> diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
> index 46408e5..e7bd668 100644
> --- a/arch/x86/xen/xen-head.S
> +++ b/arch/x86/xen/xen-head.S
> @@ -112,6 +112,8 @@ NEXT_HYPERCALL(arch_6)
>   	ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      _ASM_PTR __PAGE_OFFSET)
>   #else
>   	ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      _ASM_PTR __START_KERNEL_map)
> +	/* Map the p2m table to a 512GB-aligned user address. */
> +	ELFNOTE(Xen, XEN_ELFNOTE_INIT_P2M,       .quad PGDIR_SIZE)
>   #endif
>   	ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          _ASM_PTR startup_xen)
>   	ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, _ASM_PTR hypercall_page)
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V3] xen: eliminate scalability issues from initial mapping setup
  2014-09-23  3:58   ` Juergen Gross
@ 2014-09-23 13:10       ` David Vrabel
  0 siblings, 0 replies; 9+ messages in thread
From: David Vrabel @ 2014-09-23 13:10 UTC (permalink / raw
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	boris.ostrovsky, jbeulich

On 23/09/14 04:58, Juergen Gross wrote:
> On 09/17/2014 04:59 PM, Juergen Gross wrote:
>> Direct Xen to place the initial P->M table outside of the initial
>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>> restriction on the size of the initial mapping limits the amount
>> of memory a domain can be handed initially.
>>
>> As the initial P->M table is copied rather early during boot to
>> domain private memory and it's initial virtual mapping is dropped,
>> the easiest way to avoid virtual address conflicts with other
>> addresses in the kernel is to use a user address area for the
>> virtual address of the initial P->M table. This allows us to just
>> throw away the page tables of the initial mapping after the copy
>> without having to care about address invalidation.
>>
>> It should be noted that this patch won't enable a pv-domain to USE
>> more than 512 GB of RAM. It just enables it to be started with a
>> P->M table covering more memory. This is especially important for
>> being able to boot a Dom0 on a system with more than 512 GB memory.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Any Acks/Naks?

Sorry for the delay.  We only have one machine with 1 TB of RAM and it's
very busy.

>> +static void xen_cleanmfnmap_free_pgtbl(void *pgtbl)

This needs to be __init.  But I'll fix this up.  In future, can you make
sure you build with the option that finds these
(CONFIG_DEBUG_SECTION_MISMATCH).

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V3] xen: eliminate scalability issues from initial mapping setup
@ 2014-09-23 13:10       ` David Vrabel
  0 siblings, 0 replies; 9+ messages in thread
From: David Vrabel @ 2014-09-23 13:10 UTC (permalink / raw
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	boris.ostrovsky, jbeulich

On 23/09/14 04:58, Juergen Gross wrote:
> On 09/17/2014 04:59 PM, Juergen Gross wrote:
>> Direct Xen to place the initial P->M table outside of the initial
>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>> restriction on the size of the initial mapping limits the amount
>> of memory a domain can be handed initially.
>>
>> As the initial P->M table is copied rather early during boot to
>> domain private memory and it's initial virtual mapping is dropped,
>> the easiest way to avoid virtual address conflicts with other
>> addresses in the kernel is to use a user address area for the
>> virtual address of the initial P->M table. This allows us to just
>> throw away the page tables of the initial mapping after the copy
>> without having to care about address invalidation.
>>
>> It should be noted that this patch won't enable a pv-domain to USE
>> more than 512 GB of RAM. It just enables it to be started with a
>> P->M table covering more memory. This is especially important for
>> being able to boot a Dom0 on a system with more than 512 GB memory.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Any Acks/Naks?

Sorry for the delay.  We only have one machine with 1 TB of RAM and it's
very busy.

>> +static void xen_cleanmfnmap_free_pgtbl(void *pgtbl)

This needs to be __init.  But I'll fix this up.  In future, can you make
sure you build with the option that finds these
(CONFIG_DEBUG_SECTION_MISMATCH).

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V3] xen: eliminate scalability issues from initial mapping setup
  2014-09-17 14:59 ` [PATCH V3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
@ 2014-09-24 13:20     ` David Vrabel
  2014-09-24 13:20     ` David Vrabel
  1 sibling, 0 replies; 9+ messages in thread
From: David Vrabel @ 2014-09-24 13:20 UTC (permalink / raw
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	boris.ostrovsky, jbeulich

On 17/09/14 15:59, Juergen Gross wrote:
> Direct Xen to place the initial P->M table outside of the initial
> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
> restriction on the size of the initial mapping limits the amount
> of memory a domain can be handed initially.
> 
> As the initial P->M table is copied rather early during boot to
> domain private memory and it's initial virtual mapping is dropped,
> the easiest way to avoid virtual address conflicts with other
> addresses in the kernel is to use a user address area for the
> virtual address of the initial P->M table. This allows us to just
> throw away the page tables of the initial mapping after the copy
> without having to care about address invalidation.
> 
> It should be noted that this patch won't enable a pv-domain to USE
> more than 512 GB of RAM. It just enables it to be started with a
> P->M table covering more memory. This is especially important for
> being able to boot a Dom0 on a system with more than 512 GB memory.

This doesn't seem to work.  It crashes when attempting to construct
the page tables.  Have these patches been tested on a host with > 512 GiB?

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.17.0-rc6.davidvr (davidvr@qabil) (gcc version 4.4
[    0.000000] Command line: root=LABEL=root-kivexhrj ro hpet=disable console=tn
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000] Set 526888 page(s) to 1-1 mapping
[    0.000000] Remapped 526888 page(s), last_pfn=131598888
[    0.000000] Released 0 page(s)
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x000000007f637fff] usable
[    0.000000] Xen: [mem 0x000000007f638000-0x000000007f64dfff] reserved
[    0.000000] Xen: [mem 0x000000007f64e000-0x000000007f6ccfff] ACPI data
[    0.000000] Xen: [mem 0x000000007f6cd000-0x000000008fffffff] reserved
[    0.000000] Xen: [mem 0x00000000ecff0000-0x00000000ecff1fff] reserved
[    0.000000] Xen: [mem 0x00000000fe000000-0x00000000ffffffff] reserved
[    0.000000] Xen: [mem 0x0000000100000000-0x0000007cffffffff] usable
[    0.000000] Xen: [mem 0x0000007d00000000-0x000001007fffffff] unusable
[    0.000000] bootconsole [xenboot0] enabled
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] SMBIOS 2.6 present.
[    0.000000] AGP: No AGP bridge found
[    0.000000] e820: last_pfn = 0x7d00000 max_arch_pfn = 0x400000000
[    0.000000] e820: last_pfn = 0x7f638 max_arch_pfn = 0x400000000
[    0.000000] Scanning 1 areas for low memory corruption
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000] init_memory_mapping: [mem 0x7cffe00000-0x7cffffffff]
[    0.000000] init_memory_mapping: [mem 0x7cfc000000-0x7cffdfffff]
[    0.000000] init_memory_mapping: [mem 0x7c80000000-0x7cfbffffff]
[    0.000000] init_memory_mapping: [mem 0x7000000000-0x7c7fffffff]
[    0.000000] init_memory_mapping: [mem 0x00100000-0x7f637fff]
[    0.000000] init_memory_mapping: [mem 0x100000000-0x6fffffffff]
[    0.000000] RAMDISK: [mem 0x04000000-0x04856fff]
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x00000000000F0A90 000024 (v02 DELL  )
[    0.000000] ACPI: XSDT 0x00000000000F0C54 000094 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: FACP 0x000000007F68F588 0000F4 (v03 DELL   PE_SC3   000000)
[    0.000000] ACPI: DSDT 0x000000007F64E000 0055C3 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: FACS 0x000000007F691000 000040
[    0.000000] ACPI: APIC 0x000000007F68E478 0002DE (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: SPCR 0x000000007F68E764 000050 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: HPET 0x000000007F68E7B8 000038 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: XMAR 0x000000007F68E7F4 0001C8 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: MCFG 0x000000007F68EAE8 00003C (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: WD__ 0x000000007F68EB28 000134 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: SLIC 0x000000007F68EC60 000024 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: ERST 0x000000007F653744 000270 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: HEST 0x000000007F6539B4 000514 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: BERT 0x000000007F6535C4 000030 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: EINJ 0x000000007F6535F4 000150 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: SRAT 0x000000007F68EDE4 000738 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: TCPA 0x000000007F68F520 000064 (v02 DELL   PE_SC3   000000)
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   [mem 0x100000000-0x7cffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0x7f637fff]
[    0.000000]   node   0: [mem 0x100000000-0x7cffffffff]
[    0.000000] BUG: unable to handle kernel NULL pointer dereference at        )
[    0.000000] IP: [<ffffffff8100b7d4>] get_phys_to_machine+0x64/0x70
[    0.000000] PGD 0 
[    0.000000] Oops: 0000 [#1] SMP 
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.17.0-rc6.davidvr #1
[    0.000000] Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 1.2.0 06/220
[    0.000000] task: ffffffff81a1a4a0 ti: ffffffff81a00000 task.ti: ffffffff81a0
[    0.000000] RIP: e030:[<ffffffff8100b7d4>]  [<ffffffff8100b7d4>] get_phys_to0
[    0.000000] RSP: e02b:ffffffff81a03d70  EFLAGS: 00010007
[    0.000000] RAX: 00000080003fc000 RBX: 001000806d0000e7 RCX: 00000000000001f4
[    0.000000] RDX: ffffffff820c2000 RSI: 000000000000005a RDI: 0000000007d0025a
[    0.000000] RBP: ffffffff81a03d70 R08: ffffffff81a03d94 R09: ffff880000000000
[    0.000000] R10: ffffffff81a03d90 R11: ffffff82fff7dfff R12: 000000000806d000
[    0.000000] R13: 0000000007d0025a R14: ffff880000000000 R15: ffff880044859ec0
[    0.000000] FS:  0000000000000000(0000) GS:ffffffff81ad8000(0000) knlGS:00000
[    0.000000] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.000000] CR2: 0000000000000000 CR3: 0000000001a13000 CR4: 0000000000002660
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[    0.000000] Stack:
[    0.000000]  ffffffff81a03da0 ffffffff8100624f ffffffff81058bf7 000000807b000
[    0.000000]  00003ffffffff000 ffff887a4fce0000 ffffffff81a03db0 ffffffff8100e
[    0.000000]  ffffffff81a03e58 ffffffff810054c9 ffffff82fff7dfff ffffffff81a00
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff8100624f>] pte_mfn_to_pfn+0x7f/0x100
[    0.000000]  [<ffffffff81058bf7>] ? lookup_address_in_pgd+0x27/0xf0
[    0.000000]  [<ffffffff8100a07e>] xen_pmd_val+0xe/0x10
[    0.000000]  [<ffffffff810054c9>] __raw_callee_save_xen_pmd_val+0x11/0x1e
[    0.000000]  [<ffffffff81af2640>] ? xen_pagetable_init+0x1ba/0x3cb
[    0.000000]  [<ffffffff81af678b>] setup_arch+0xbcd/0xccf
[    0.000000]  [<ffffffff8159ecbe>] ? printk+0x4d/0x4f
[    0.000000]  [<ffffffff81aedcfd>] start_kernel+0x8b/0x416
[    0.000000]  [<ffffffff81aed5f0>] x86_64_start_reservations+0x2a/0x2c
[    0.000000]  [<ffffffff81af0fc7>] xen_start_kernel+0x582/0x584
[    0.000000] Code: f9 48 89 f8 48 c1 e9 12 48 c1 e8 09 48 89 fe 25 ff 01 00 0 
[    0.000000] RIP  [<ffffffff8100b7d4>] get_phys_to_machine+0x64/0x70
[    0.000000]  RSP <ffffffff81a03d70>
[    0.000000] CR2: 0000000000000000
[    0.000000] ---[ end trace 7aee8d2e027fb7f0 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V3] xen: eliminate scalability issues from initial mapping setup
@ 2014-09-24 13:20     ` David Vrabel
  0 siblings, 0 replies; 9+ messages in thread
From: David Vrabel @ 2014-09-24 13:20 UTC (permalink / raw
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	boris.ostrovsky, jbeulich

On 17/09/14 15:59, Juergen Gross wrote:
> Direct Xen to place the initial P->M table outside of the initial
> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
> restriction on the size of the initial mapping limits the amount
> of memory a domain can be handed initially.
> 
> As the initial P->M table is copied rather early during boot to
> domain private memory and it's initial virtual mapping is dropped,
> the easiest way to avoid virtual address conflicts with other
> addresses in the kernel is to use a user address area for the
> virtual address of the initial P->M table. This allows us to just
> throw away the page tables of the initial mapping after the copy
> without having to care about address invalidation.
> 
> It should be noted that this patch won't enable a pv-domain to USE
> more than 512 GB of RAM. It just enables it to be started with a
> P->M table covering more memory. This is especially important for
> being able to boot a Dom0 on a system with more than 512 GB memory.

This doesn't seem to work.  It crashes when attempting to construct
the page tables.  Have these patches been tested on a host with > 512 GiB?

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.17.0-rc6.davidvr (davidvr@qabil) (gcc version 4.4
[    0.000000] Command line: root=LABEL=root-kivexhrj ro hpet=disable console=tn
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000] Set 526888 page(s) to 1-1 mapping
[    0.000000] Remapped 526888 page(s), last_pfn=131598888
[    0.000000] Released 0 page(s)
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x000000007f637fff] usable
[    0.000000] Xen: [mem 0x000000007f638000-0x000000007f64dfff] reserved
[    0.000000] Xen: [mem 0x000000007f64e000-0x000000007f6ccfff] ACPI data
[    0.000000] Xen: [mem 0x000000007f6cd000-0x000000008fffffff] reserved
[    0.000000] Xen: [mem 0x00000000ecff0000-0x00000000ecff1fff] reserved
[    0.000000] Xen: [mem 0x00000000fe000000-0x00000000ffffffff] reserved
[    0.000000] Xen: [mem 0x0000000100000000-0x0000007cffffffff] usable
[    0.000000] Xen: [mem 0x0000007d00000000-0x000001007fffffff] unusable
[    0.000000] bootconsole [xenboot0] enabled
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] SMBIOS 2.6 present.
[    0.000000] AGP: No AGP bridge found
[    0.000000] e820: last_pfn = 0x7d00000 max_arch_pfn = 0x400000000
[    0.000000] e820: last_pfn = 0x7f638 max_arch_pfn = 0x400000000
[    0.000000] Scanning 1 areas for low memory corruption
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000] init_memory_mapping: [mem 0x7cffe00000-0x7cffffffff]
[    0.000000] init_memory_mapping: [mem 0x7cfc000000-0x7cffdfffff]
[    0.000000] init_memory_mapping: [mem 0x7c80000000-0x7cfbffffff]
[    0.000000] init_memory_mapping: [mem 0x7000000000-0x7c7fffffff]
[    0.000000] init_memory_mapping: [mem 0x00100000-0x7f637fff]
[    0.000000] init_memory_mapping: [mem 0x100000000-0x6fffffffff]
[    0.000000] RAMDISK: [mem 0x04000000-0x04856fff]
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x00000000000F0A90 000024 (v02 DELL  )
[    0.000000] ACPI: XSDT 0x00000000000F0C54 000094 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: FACP 0x000000007F68F588 0000F4 (v03 DELL   PE_SC3   000000)
[    0.000000] ACPI: DSDT 0x000000007F64E000 0055C3 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: FACS 0x000000007F691000 000040
[    0.000000] ACPI: APIC 0x000000007F68E478 0002DE (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: SPCR 0x000000007F68E764 000050 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: HPET 0x000000007F68E7B8 000038 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: XMAR 0x000000007F68E7F4 0001C8 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: MCFG 0x000000007F68EAE8 00003C (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: WD__ 0x000000007F68EB28 000134 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: SLIC 0x000000007F68EC60 000024 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: ERST 0x000000007F653744 000270 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: HEST 0x000000007F6539B4 000514 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: BERT 0x000000007F6535C4 000030 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: EINJ 0x000000007F6535F4 000150 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: SRAT 0x000000007F68EDE4 000738 (v01 DELL   PE_SC3   000000)
[    0.000000] ACPI: TCPA 0x000000007F68F520 000064 (v02 DELL   PE_SC3   000000)
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   [mem 0x100000000-0x7cffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0x7f637fff]
[    0.000000]   node   0: [mem 0x100000000-0x7cffffffff]
[    0.000000] BUG: unable to handle kernel NULL pointer dereference at        )
[    0.000000] IP: [<ffffffff8100b7d4>] get_phys_to_machine+0x64/0x70
[    0.000000] PGD 0 
[    0.000000] Oops: 0000 [#1] SMP 
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.17.0-rc6.davidvr #1
[    0.000000] Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 1.2.0 06/220
[    0.000000] task: ffffffff81a1a4a0 ti: ffffffff81a00000 task.ti: ffffffff81a0
[    0.000000] RIP: e030:[<ffffffff8100b7d4>]  [<ffffffff8100b7d4>] get_phys_to0
[    0.000000] RSP: e02b:ffffffff81a03d70  EFLAGS: 00010007
[    0.000000] RAX: 00000080003fc000 RBX: 001000806d0000e7 RCX: 00000000000001f4
[    0.000000] RDX: ffffffff820c2000 RSI: 000000000000005a RDI: 0000000007d0025a
[    0.000000] RBP: ffffffff81a03d70 R08: ffffffff81a03d94 R09: ffff880000000000
[    0.000000] R10: ffffffff81a03d90 R11: ffffff82fff7dfff R12: 000000000806d000
[    0.000000] R13: 0000000007d0025a R14: ffff880000000000 R15: ffff880044859ec0
[    0.000000] FS:  0000000000000000(0000) GS:ffffffff81ad8000(0000) knlGS:00000
[    0.000000] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.000000] CR2: 0000000000000000 CR3: 0000000001a13000 CR4: 0000000000002660
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[    0.000000] Stack:
[    0.000000]  ffffffff81a03da0 ffffffff8100624f ffffffff81058bf7 000000807b000
[    0.000000]  00003ffffffff000 ffff887a4fce0000 ffffffff81a03db0 ffffffff8100e
[    0.000000]  ffffffff81a03e58 ffffffff810054c9 ffffff82fff7dfff ffffffff81a00
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff8100624f>] pte_mfn_to_pfn+0x7f/0x100
[    0.000000]  [<ffffffff81058bf7>] ? lookup_address_in_pgd+0x27/0xf0
[    0.000000]  [<ffffffff8100a07e>] xen_pmd_val+0xe/0x10
[    0.000000]  [<ffffffff810054c9>] __raw_callee_save_xen_pmd_val+0x11/0x1e
[    0.000000]  [<ffffffff81af2640>] ? xen_pagetable_init+0x1ba/0x3cb
[    0.000000]  [<ffffffff81af678b>] setup_arch+0xbcd/0xccf
[    0.000000]  [<ffffffff8159ecbe>] ? printk+0x4d/0x4f
[    0.000000]  [<ffffffff81aedcfd>] start_kernel+0x8b/0x416
[    0.000000]  [<ffffffff81aed5f0>] x86_64_start_reservations+0x2a/0x2c
[    0.000000]  [<ffffffff81af0fc7>] xen_start_kernel+0x582/0x584
[    0.000000] Code: f9 48 89 f8 48 c1 e9 12 48 c1 e8 09 48 89 fe 25 ff 01 00 0 
[    0.000000] RIP  [<ffffffff8100b7d4>] get_phys_to_machine+0x64/0x70
[    0.000000]  RSP <ffffffff81a03d70>
[    0.000000] CR2: 0000000000000000
[    0.000000] ---[ end trace 7aee8d2e027fb7f0 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V3] xen: eliminate scalability issues from initial mapping setup
  2014-09-24 13:20     ` David Vrabel
  (?)
@ 2014-09-24 14:03     ` Juergen Gross
  -1 siblings, 0 replies; 9+ messages in thread
From: Juergen Gross @ 2014-09-24 14:03 UTC (permalink / raw
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk,
	boris.ostrovsky, jbeulich

On 09/24/2014 03:20 PM, David Vrabel wrote:
> On 17/09/14 15:59, Juergen Gross wrote:
>> Direct Xen to place the initial P->M table outside of the initial
>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>> restriction on the size of the initial mapping limits the amount
>> of memory a domain can be handed initially.
>>
>> As the initial P->M table is copied rather early during boot to
>> domain private memory and it's initial virtual mapping is dropped,
>> the easiest way to avoid virtual address conflicts with other
>> addresses in the kernel is to use a user address area for the
>> virtual address of the initial P->M table. This allows us to just
>> throw away the page tables of the initial mapping after the copy
>> without having to care about address invalidation.
>>
>> It should be noted that this patch won't enable a pv-domain to USE
>> more than 512 GB of RAM. It just enables it to be started with a
>> P->M table covering more memory. This is especially important for
>> being able to boot a Dom0 on a system with more than 512 GB memory.
>
> This doesn't seem to work.  It crashes when attempting to construct
> the page tables.  Have these patches been tested on a host with > 512 GiB?

Not yet. I did a code review and was pretty sure the memory above 512GB
would be ignored - seems as if I was wrong.

I'll have access to a machine with 1TB RAM soon, so I'll try to test a
patch which really does what I thought should be done: ignoring the
memory above 512GB.

Thanks for testing!


Juergen

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V3] xen: eliminate scalability issues from initial mapping setup
  2014-09-24 13:20     ` David Vrabel
  (?)
  (?)
@ 2014-09-26  7:54     ` Juergen Gross
  -1 siblings, 0 replies; 9+ messages in thread
From: Juergen Gross @ 2014-09-26  7:54 UTC (permalink / raw
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk,
	boris.ostrovsky, jbeulich

On 09/24/2014 03:20 PM, David Vrabel wrote:
> On 17/09/14 15:59, Juergen Gross wrote:
>> Direct Xen to place the initial P->M table outside of the initial
>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>> restriction on the size of the initial mapping limits the amount
>> of memory a domain can be handed initially.
>>
>> As the initial P->M table is copied rather early during boot to
>> domain private memory and it's initial virtual mapping is dropped,
>> the easiest way to avoid virtual address conflicts with other
>> addresses in the kernel is to use a user address area for the
>> virtual address of the initial P->M table. This allows us to just
>> throw away the page tables of the initial mapping after the copy
>> without having to care about address invalidation.
>>
>> It should be noted that this patch won't enable a pv-domain to USE
>> more than 512 GB of RAM. It just enables it to be started with a
>> P->M table covering more memory. This is especially important for
>> being able to boot a Dom0 on a system with more than 512 GB memory.
>
> This doesn't seem to work.  It crashes when attempting to construct
> the page tables.  Have these patches been tested on a host with > 512 GiB?

After some tests on a machine with 1 TB memory I think this patch makes
sense only if the booted domain will be capable of using all the memory.
Ignoring everything above 512 GB will not work as the hypervisor is
putting vital information in this region (mfn_list, initrd, ...).

So please ignore this patch. I'm trying to setup a solution using a
virtually mapped linear p2m list.


Juergen


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-09-26  7:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-17 14:59 [PATCH V3] xen: remove some memory limits from pv-domains Juergen Gross
2014-09-17 14:59 ` [PATCH V3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
2014-09-23  3:58   ` Juergen Gross
2014-09-23 13:10     ` David Vrabel
2014-09-23 13:10       ` David Vrabel
2014-09-24 13:20   ` David Vrabel
2014-09-24 13:20     ` David Vrabel
2014-09-24 14:03     ` Juergen Gross
2014-09-26  7:54     ` Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.