All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: "Joshi, Mukul" <Mukul.Joshi@amd.com>
Cc: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
	x86-ml <x86@kernel.org>, lkml <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Wed, 12 May 2021 23:05:38 +0200	[thread overview]
Message-ID: <YJxDIhGnZ5XdukiS@zn.tnic> (raw)
In-Reply-To: <DM4PR12MB5263797EB7B2D37C21427A88EE529@DM4PR12MB5263.namprd12.prod.outlook.com>

On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote:
> SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> only interested in errors on GPU UMC.

So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2.

> We cannot know this without is_smca_umc_v2.

You don't need it - just export smca_get_bank_type() and test the bank
type at the call site.

> Maybe. I hope its not too much of a concern if it stays the way it is.

That was just a suggestion anyway - it is not code I maintain so not my
call.

> I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices.
> I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices
> are two different class of devices.
> That is the reason I create a new one. 
> I am OK to use EDAC priority if that is acceptable.

I don't know what's acceptable because I still am unclear as to what
that thing is supposed to do. It seems you are interested only in
uncorrectable errors.

How are those errors reported? #MC exception, deferred interrupt, simply
logged in the bank and we find them by polling?

Then, the commit message is talking about some "bad page retirement".
What does that do? What can the user do when she sees the

"Uncorrectable error detected in UMC..."

message?

It depends on what "retiring" of GPU pages means...

In any case, dmesg should issue a human-understandable message about the
recovery action being done and what that means for the user: should she
replace the GPU, should she ignore, etc, etc.

> A system can have multiple GPUs and we only want a single notifier
> registered. I will change the comment to explicitly state this.

Actually, the notifier registration should be able to return a different
retval to state that a callback has already been registered but that
warns only currently so I'm guessing we're stuck with such ugly
"workarounds" for their shortcomings.

I'm gonna take a look whether they can be fixed though so that you don't
have to do this notifier_registered thing.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

WARNING: multiple messages have this Message-ID (diff)
From: Borislav Petkov <bp@alien8.de>
To: "Joshi, Mukul" <Mukul.Joshi@amd.com>
Cc: x86-ml <x86@kernel.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
	lkml <linux-kernel@vger.kernel.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Wed, 12 May 2021 23:05:38 +0200	[thread overview]
Message-ID: <YJxDIhGnZ5XdukiS@zn.tnic> (raw)
In-Reply-To: <DM4PR12MB5263797EB7B2D37C21427A88EE529@DM4PR12MB5263.namprd12.prod.outlook.com>

On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote:
> SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> only interested in errors on GPU UMC.

So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2.

> We cannot know this without is_smca_umc_v2.

You don't need it - just export smca_get_bank_type() and test the bank
type at the call site.

> Maybe. I hope its not too much of a concern if it stays the way it is.

That was just a suggestion anyway - it is not code I maintain so not my
call.

> I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices.
> I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices
> are two different class of devices.
> That is the reason I create a new one. 
> I am OK to use EDAC priority if that is acceptable.

I don't know what's acceptable because I still am unclear as to what
that thing is supposed to do. It seems you are interested only in
uncorrectable errors.

How are those errors reported? #MC exception, deferred interrupt, simply
logged in the bank and we find them by polling?

Then, the commit message is talking about some "bad page retirement".
What does that do? What can the user do when she sees the

"Uncorrectable error detected in UMC..."

message?

It depends on what "retiring" of GPU pages means...

In any case, dmesg should issue a human-understandable message about the
recovery action being done and what that means for the user: should she
replace the GPU, should she ignore, etc, etc.

> A system can have multiple GPUs and we only want a single notifier
> registered. I will change the comment to explicitly state this.

Actually, the notifier registration should be able to return a different
retval to state that a callback has already been registered but that
warns only currently so I'm guessing we're stuck with such ugly
"workarounds" for their shortcomings.

I'm gonna take a look whether they can be fixed though so that you don't
have to do this notifier_registered thing.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-05-12 22:19 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-12  1:30 [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Mukul Joshi
2021-05-12  9:36 ` Borislav Petkov
2021-05-12  9:36   ` Borislav Petkov
2021-05-12 19:00   ` Joshi, Mukul
2021-05-12 19:00     ` Joshi, Mukul
2021-05-12 21:05     ` Borislav Petkov [this message]
2021-05-12 21:05       ` Borislav Petkov
2021-05-13  3:20       ` Joshi, Mukul
2021-05-13  3:20         ` Joshi, Mukul
2021-05-13  9:53         ` Borislav Petkov
2021-05-13  9:53           ` Borislav Petkov
2021-05-13 14:17           ` Alex Deucher
2021-05-13 14:17             ` Alex Deucher
2021-05-13 14:30             ` Borislav Petkov
2021-05-13 14:30               ` Borislav Petkov
2021-05-13 14:32               ` Alex Deucher
2021-05-13 14:32                 ` Alex Deucher
2021-05-13 14:57                 ` Borislav Petkov
2021-05-13 14:57                   ` Borislav Petkov
2021-05-13 15:02                   ` Alex Deucher
2021-05-13 15:02                     ` Alex Deucher
2021-05-13 23:14                   ` Joshi, Mukul
2021-05-13 23:14                     ` Joshi, Mukul
2021-05-14  7:03                     ` Borislav Petkov
2021-05-14  7:03                       ` Borislav Petkov
2021-05-27 19:54                       ` Joshi, Mukul
2021-05-27 19:54                         ` Joshi, Mukul
2021-06-03 21:13                         ` Yazen Ghannam
2021-06-03 21:13                           ` Yazen Ghannam
2021-07-29 23:59                           ` Joshi, Mukul
2021-07-29 23:59                             ` Joshi, Mukul
2021-09-13  1:31                             ` Joshi, Mukul
2021-05-13 23:10           ` Joshi, Mukul
2021-05-13 23:10             ` Joshi, Mukul
2021-05-14  7:05             ` Borislav Petkov
2021-05-14  7:05               ` Borislav Petkov
2021-05-14 13:06               ` Joshi, Mukul
2021-05-14 13:06                 ` Joshi, Mukul
2021-05-14 14:38                 ` Borislav Petkov
2021-05-14 14:38                   ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YJxDIhGnZ5XdukiS@zn.tnic \
    --to=bp@alien8.de \
    --cc=Harish.Kasiviswanathan@amd.com \
    --cc=Mukul.Joshi@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.