All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Yazen Ghannam <yazen.ghannam@amd.com>
To: "Joshi, Mukul" <Mukul.Joshi@amd.com>
Cc: x86-ml <x86@kernel.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
	lkml <linux-kernel@vger.kernel.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	Borislav Petkov <bp@alien8.de>,
	Alex Deucher <alexdeucher@gmail.com>
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Thu, 3 Jun 2021 17:13:14 -0400	[thread overview]
Message-ID: <20210603211255.GA1410@aus-x-yghannam.amd.com> (raw)
In-Reply-To: <DM4PR12MB5263BCFD05993959820430A5EE239@DM4PR12MB5263.namprd12.prod.outlook.com>

On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
...
> > Is that the same deferred interrupt which calls
> > amd_deferred_error_interrupt() ?
> 
> Sorry picking this up after sometime. I thought I had replied to this email.
> Yes it is the same deferred interrupt which calls amd_deferred_error_interrupt().
>

Mukul,

Do you expect that the driver will need to mark pages with high
correctable error counts as bad? I think the hardware folks may want the
GPU memory errors to be handled more aggressively than CPU memory
errors. The specific threshold may change from product to product, so it
may make sense to hardcode this in the driver.

We have similar functionality in the Correctable Errors Collector. But
enterprise users may prefer a direct approach done in the driver (based
on the hardware experts' guidance) instead of configuring the kernel at
runtime.

So I think having a separate priority may make sense if some special
functionality, or combination of behaviors, is needed which don't fall
under any exisiting things. In this case, "special functionality" could
be that the GPU memory needs to be handled differently than CPU memory.

Another thing is that this behavior is similar to the NFIT behavior,
i.e. there's a memory error on an external device that needs to be
handled by the device's driver. So maybe we can rename MCE_PRIO_NFIT to
be generic (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with
the same priority is okay, right?

Thanks,
Yazen
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

WARNING: multiple messages have this Message-ID (diff)
From: Yazen Ghannam <yazen.ghannam@amd.com>
To: "Joshi, Mukul" <Mukul.Joshi@amd.com>
Cc: Borislav Petkov <bp@alien8.de>,
	Alex Deucher <alexdeucher@gmail.com>, x86-ml <x86@kernel.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
	lkml <linux-kernel@vger.kernel.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Thu, 3 Jun 2021 17:13:14 -0400	[thread overview]
Message-ID: <20210603211255.GA1410@aus-x-yghannam.amd.com> (raw)
In-Reply-To: <DM4PR12MB5263BCFD05993959820430A5EE239@DM4PR12MB5263.namprd12.prod.outlook.com>

On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
...
> > Is that the same deferred interrupt which calls
> > amd_deferred_error_interrupt() ?
> 
> Sorry picking this up after sometime. I thought I had replied to this email.
> Yes it is the same deferred interrupt which calls amd_deferred_error_interrupt().
>

Mukul,

Do you expect that the driver will need to mark pages with high
correctable error counts as bad? I think the hardware folks may want the
GPU memory errors to be handled more aggressively than CPU memory
errors. The specific threshold may change from product to product, so it
may make sense to hardcode this in the driver.

We have similar functionality in the Correctable Errors Collector. But
enterprise users may prefer a direct approach done in the driver (based
on the hardware experts' guidance) instead of configuring the kernel at
runtime.

So I think having a separate priority may make sense if some special
functionality, or combination of behaviors, is needed which don't fall
under any exisiting things. In this case, "special functionality" could
be that the GPU memory needs to be handled differently than CPU memory.

Another thing is that this behavior is similar to the NFIT behavior,
i.e. there's a memory error on an external device that needs to be
handled by the device's driver. So maybe we can rename MCE_PRIO_NFIT to
be generic (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with
the same priority is okay, right?

Thanks,
Yazen

  reply	other threads:[~2021-06-03 21:13 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-12  1:30 [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Mukul Joshi
2021-05-12  9:36 ` Borislav Petkov
2021-05-12  9:36   ` Borislav Petkov
2021-05-12 19:00   ` Joshi, Mukul
2021-05-12 19:00     ` Joshi, Mukul
2021-05-12 21:05     ` Borislav Petkov
2021-05-12 21:05       ` Borislav Petkov
2021-05-13  3:20       ` Joshi, Mukul
2021-05-13  3:20         ` Joshi, Mukul
2021-05-13  9:53         ` Borislav Petkov
2021-05-13  9:53           ` Borislav Petkov
2021-05-13 14:17           ` Alex Deucher
2021-05-13 14:17             ` Alex Deucher
2021-05-13 14:30             ` Borislav Petkov
2021-05-13 14:30               ` Borislav Petkov
2021-05-13 14:32               ` Alex Deucher
2021-05-13 14:32                 ` Alex Deucher
2021-05-13 14:57                 ` Borislav Petkov
2021-05-13 14:57                   ` Borislav Petkov
2021-05-13 15:02                   ` Alex Deucher
2021-05-13 15:02                     ` Alex Deucher
2021-05-13 23:14                   ` Joshi, Mukul
2021-05-13 23:14                     ` Joshi, Mukul
2021-05-14  7:03                     ` Borislav Petkov
2021-05-14  7:03                       ` Borislav Petkov
2021-05-27 19:54                       ` Joshi, Mukul
2021-05-27 19:54                         ` Joshi, Mukul
2021-06-03 21:13                         ` Yazen Ghannam [this message]
2021-06-03 21:13                           ` Yazen Ghannam
2021-07-29 23:59                           ` Joshi, Mukul
2021-07-29 23:59                             ` Joshi, Mukul
2021-09-13  1:31                             ` Joshi, Mukul
2021-05-13 23:10           ` Joshi, Mukul
2021-05-13 23:10             ` Joshi, Mukul
2021-05-14  7:05             ` Borislav Petkov
2021-05-14  7:05               ` Borislav Petkov
2021-05-14 13:06               ` Joshi, Mukul
2021-05-14 13:06                 ` Joshi, Mukul
2021-05-14 14:38                 ` Borislav Petkov
2021-05-14 14:38                   ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210603211255.GA1410@aus-x-yghannam.amd.com \
    --to=yazen.ghannam@amd.com \
    --cc=Harish.Kasiviswanathan@amd.com \
    --cc=Mukul.Joshi@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=bp@alien8.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.