All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Gregory Price <gregory.price@memverge.com>, <linux-cxl@vger.kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Dave Jiang <dave.jiang@intel.com>
Subject: Re: [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check
Date: Mon, 17 Apr 2023 23:43:27 -0700	[thread overview]
Message-ID: <643e3c0f22afd_556e2941c@dwillia2-mobl3.amr.corp.intel.com.notmuch> (raw)
In-Reply-To: <ZDfp/+7uTyh2wWcX@memverge.com>

Gregory Price wrote:
> On Wed, Apr 12, 2023 at 02:43:33PM -0400, Gregory Price wrote:
> > 
> > 
> > I was looking to validate mlock-ability of various pages when CXL is in
> > different states (numa, dax, etc), and I discovered a page_table_check
> > BUG when accessing MemExp memory while a device is in daxdev mode.
> > 
> > this happens essentially on a fault of the first accessed page
> > 
> > int dax_fd = open(device_path, O_RDWR);
> > void *mapped_memory = mmap(NULL, (1024*1024*2), PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);
> > ((char*)mapped_memory)[0] = 1;
> > 
> > 
> > Full details of my test here:
> > 
> > Step 1) Test that memory onlined in NUMA node works
> > 
> > [user@host0 ~]# numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> > node 0 size: 63892 MB
> > node 0 free: 59622 MB
> > node 1 cpus:
> > node 1 size: 129024 MB
> > node 1 free: 129024 MB
> > node distances:
> > node   0   1
> >   0:  10  50
> >   1:  255  10
> > 
> > 
> > [user@host0 ~]# numactl --preferred=1 memhog 128G
> > ... snip ...
> > 
> > Passes no problem, all memory is accessible and used.
> > 
> > 
> > 
> > Next, reconfigure the device to daxdev mode
> > 
> > 
> > [user@host0 ~]# daxctl list
> > [
> >   {
> >     "chardev":"dax0.0",
> >     "size":137438953472,
> >     "target_node":1,
> >     "align":2097152,
> >     "mode":"system-ram",
> >     "online_memblocks":63,
> >     "total_memblocks":63,
> >     "movable":true
> >   }
> > ]
> 
> 
> Follow up - i was investigating why my dax region here only created 63
> 2GB MemBlocks for a 128GB region, and the reason is a forced alignment
> of dax devices against the CXL Fixed Memory Window.
> 
> [    0.000000] BIOS-e820: [mem 0x0000001050000000-0x000000304fffffff] soft reserved
> [    0.000000] BIOS-e820: [mem 0x00003ffc00000000-0x00003ffc03ffffff] reserved
> [    0.000000] reserve setup_data: [mem 0x0000001050000000-0x000000304fffffff] soft reserved
> [    0.000000] reserve setup_data: [mem 0x00003ffc00000000-0x00003ffc03ffffff] reserved
> 
> 
> some debug prints i added
> 
> [   20.726483] dax cxl probe
> [   20.727330] cxl_dax_region dax_region0: alloc_dax_region: start 1050000000 end 304fffffff
> [   20.728405] Creating dev_dev
> [   20.729033] dev_dax nr_range: 0
> [   20.735481]  dax0.0: alloc range[0]: 0x0000001050000000:0x000000304fffffff
> 
> The memory backing this dax region gets squashed by this code:
> 
> +++ b/drivers/dax/kmem.c
> static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
>         struct dev_dax_range *dax_range = &dev_dax->ranges[i];
>         struct range *range = &dax_range->range;
> 
>         /* memory-block align the hotplug range */
>         r->start = ALIGN(range->start, memory_block_size_bytes());
>         r->end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
>         if (r->start >= r->end) {
>                 r->start = range->start;
>                 r->end = range->end;
> 
> 
> and we end up with a mapping range of:
> 
> start=0x1080000000
> end=0x2fffffffff
> 
> 
> Why NUMA-mode works under these conditions without crashing the system
> is escaping me at the moment,

Why would it crash? That range is valid within
0x1050000000-0x304fffffff.

>  given that the page faulting system goes
> through the same driver.  But my guess is that pfn-to-page mappings are
> off in some way when placed in devdax mode, whereas they're correct
> under numa mode.

pfn-to-page is pretty simple, its the pfn to page_ext that's concerning
for CONFIG_PAGE_TABLE_CHECK.

> Note that the above code chops off the first 768MB of the dax region and
> the last 1.25GB of the dax region.

Yes, if the core-mm picks 2GB for the block size (which it does for
systems with more the 64GB of memory, then it will align hot-added
ranges.

> The CFWM is required to be 256MB aligned, but this code will force
> anything mapped into that area to be 2GB aligned.  I don't think it's
> safe to safe the BIOS is wrong.

The *minimum* alignment of the CFMWS window is 256M, but if they don't
want to waste memory on Linux they had better make it 2GB aligned.

BIOS looks ok here.

> It seems like the dax region ranges are being tied to memory block size,
> but that a raw devdax does not necessarily utilize memory blocks.  Is
> there a potential bug in the mode-switching code?

No memory-blocks to worry about in dax-mode. Until evidence to the
contrary, I'm still looking for how CONFIG_PAGE_TABLE_CHECK might get
confused by DAX mode switches.

  reply	other threads:[~2023-04-18  6:43 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-12 18:43 [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check Gregory Price
2023-04-13 11:39 ` Gregory Price
2023-04-18  6:43   ` Dan Williams [this message]
2023-04-20  0:58     ` Gregory Price
2023-04-18  6:35 ` Dan Williams
2023-04-20  1:29   ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=643e3c0f22afd_556e2941c@dwillia2-mobl3.amr.corp.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=gregory.price@memverge.com \
    --cc=linux-cxl@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.