Re: Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix?

Nouveau Archive mirror
 help / color / mirror / Atom feed

From: Ben Skeggs <bskeggs@nvidia.com>
To: David Airlie <airlied@redhat.com>, Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@redhat.com>, Timur Tabi <ttabi@nvidia.com>,
	<nouveau@lists.freedesktop.org>
Subject: Re: Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix?
Date: Fri, 19 Apr 2024 23:54:25 +1000	[thread overview]
Message-ID: <8273a133-b2b2-40c2-af27-f57c8fb5cbf0@nvidia.com> (raw)
In-Reply-To: <CAMwc25qzmHJ4C=qxRB0GEwcZqGn2wpd7zFRvx0DphEbJvY9pEw@mail.gmail.com>

On 19/4/24 08:14, David Airlie wrote:

> On Fri, Apr 19, 2024 at 6:27 AM Lyude Paul <lyude@redhat.com> wrote:
>> So - first some context here for Ben and anyone else who hasn't been
>> following. A little while ago I got a Slimbook Executive 16 with a
>> Nvidia RTX 4060 in it, and I've unfortunately been running into a kind
>> of annoying issue. Currently this laptop only has 16 gigs of ram, and
>> as it turns out - this can easily lead the system to having pretty
>> heavy memory fragmentation once it starts swapping pages out.
>>
>> Normally this wouldn't matter, but I unfortunately discovered that when
>> we're runtime suspending the GPU in Nouveau - we actually appear to
>> allocate some of the memory we use for migrating using
>> dma_alloc_coherent. This starts to fail on my system once memory
>> fragmentation goes up like so:
>>
>>    kworker/18:0: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL),
>>    nodemask=(null),cpuset=/,mems_allowed=0
>>    CPU: 18 PID: 287012 Comm: kworker/18:0 Not tainted
>>    6.8.4-200.ChopperV1.fc39.x86_64 #1
>>    Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024
>>    Workqueue: pm pm_runtime_work
>>    Call Trace:
>>     <TASK>
>>     dump_stack_lvl+0x47/0x60
>>     warn_alloc+0x165/0x1e0
>>     ? __alloc_pages_direct_compact+0x1ad/0x2b0
>>     __alloc_pages_slowpath.constprop.0+0xd7d/0xde0
>>     __alloc_pages+0x32d/0x350
>>     __dma_direct_alloc_pages.isra.0+0x16a/0x2b0
>>     dma_direct_alloc+0x70/0x280
>>     nvkm_gsp_radix3_sg+0x5e/0x130 [nouveau]
>>     r535_gsp_fini+0x1d4/0x350 [nouveau]
>>     nvkm_subdev_fini+0x67/0x150 [nouveau]
>>     nvkm_device_fini+0x95/0x1e0 [nouveau]
>>     nvkm_udevice_fini+0x53/0x70 [nouveau]
>>     nvkm_object_fini+0xb9/0x240 [nouveau]
>>     nvkm_object_fini+0x75/0x240 [nouveau]
>>     nouveau_do_suspend+0xf5/0x280 [nouveau]
>>     nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau]
>>     pci_pm_runtime_suspend+0x67/0x1e0
>>     ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>>     __rpm_callback+0x41/0x170
>>     ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>>     rpm_callback+0x5d/0x70
>>     ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>>     rpm_suspend+0x120/0x6a0
>>     pm_runtime_work+0x98/0xb0
>>     process_one_work+0x171/0x340
>>     worker_thread+0x27b/0x3a0
>>     ? __pfx_worker_thread+0x10/0x10
>>     kthread+0xe5/0x120
>>     ? __pfx_kthread+0x10/0x10
>>     ret_from_fork+0x31/0x50
>>     ? __pfx_kthread+0x10/0x10
>>     ret_from_fork_asm+0x1b/0x30
>>
>>    nouveau 0000:01:00.0: gsp: suspend failed, -12
>>    nouveau: DRM-master:00000000:00000080: suspend failed with -12
>>    nouveau 0000:01:00.0: can't suspend (nouveau_pmops_runtime_suspend
>>    [nouveau] returned -12)
>>
>> Keep in mind, I don't dive into memory management related stuff like
>> this very often! But I'd very much like to know how to help out
>> anywhere around the driver, including outside of my usual domains, so
>> I've been trying to write up a patch for this. The original suggestion
>> for a fix that Dave Airlie had given me was (unless I misunderstood,
>> which isn't unlikely) to try to see if we could get nvkm_gsp_mem_ctor()
>> to start allocating memory with vmalloc() and map that onto the GPU
>> using the SG helpers instead. So - I gave a shot at writing up a patch
>> for doing that:
>>
>> https://gitlab.freedesktop.org/lyudess/linux/-/commit/b5a41ac2bd948979815d262d8d20b4f3333f9c26
>>
>> As you can probably guess - the patch does not really seem to work, and
>> I've been trying to figure out why. There's already a couple of issues
>> I'm aware of: the most glaring one being that as Timur pointed out, a
>> lot of GSP hardware expects contiguous memory allocations - but
>> according to them the allocation that's specifically failing should be
>> small enough that it'd be allocated in a contiguous page anyway:
> nvkm_gsp_mem_ctor is used to do coherent allocations in a bunch of
> places in the gsp code, we can't use vmalloc for a lot of them. A lot
> of the allocations are small multi-page and hang around and the
> hardware expects allocations to be non-scattered.
>
> Now in this single case we have a large amount of data pointed to by a
> radix3 page table.
>
> The data is allocated with nvkm_gsp_sg, then we fail to allocate the
> first level of page tables with the coherent allocation. However I
> don't think the first level of the page table needs to be allocated
> with the coherent allocator, we should allocate it with nvkm_gsp_sg
> instead.

Yes, that seems sensible here.  Lyude, did you want me to take a look at 
making this change, or are you working on it already?

Ben.

>
> Dave.
>

next prev parent reply	other threads:[~2024-04-22  0:14 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-18 20:27 Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix? Lyude Paul
2024-04-18 22:14 ` David Airlie
2024-04-19 13:54   ` Ben Skeggs [this message]
2024-04-19 13:52 ` Ben Skeggs

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8273a133-b2b2-40c2-af27-f57c8fb5cbf0@nvidia.com \
    --to=bskeggs@nvidia.com \
    --cc=airlied@redhat.com \
    --cc=dakr@redhat.com \
    --cc=lyude@redhat.com \
    --cc=nouveau@lists.freedesktop.org \
    --cc=ttabi@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).