Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

From: Borislav Petkov <bp@alien8.de>
To: "Joshi, Mukul" <Mukul.Joshi@amd.com>
Cc: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
	x86-ml <x86@kernel.org>, lkml <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Wed, 12 May 2021 23:05:38 +0200	[thread overview]
Message-ID: <YJxDIhGnZ5XdukiS@zn.tnic> (raw)
In-Reply-To: <DM4PR12MB5263797EB7B2D37C21427A88EE529@DM4PR12MB5263.namprd12.prod.outlook.com>

On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote:
> SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> only interested in errors on GPU UMC.

So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2.

> We cannot know this without is_smca_umc_v2.

You don't need it - just export smca_get_bank_type() and test the bank
type at the call site.

> Maybe. I hope its not too much of a concern if it stays the way it is.

That was just a suggestion anyway - it is not code I maintain so not my
call.

> I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices.
> I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices
> are two different class of devices.
> That is the reason I create a new one. 
> I am OK to use EDAC priority if that is acceptable.

I don't know what's acceptable because I still am unclear as to what
that thing is supposed to do. It seems you are interested only in
uncorrectable errors.

How are those errors reported? #MC exception, deferred interrupt, simply
logged in the bank and we find them by polling?

Then, the commit message is talking about some "bad page retirement".
What does that do? What can the user do when she sees the

"Uncorrectable error detected in UMC..."

message?

It depends on what "retiring" of GPU pages means...

In any case, dmesg should issue a human-understandable message about the
recovery action being done and what that means for the user: should she
replace the GPU, should she ignore, etc, etc.

> A system can have multiple GPUs and we only want a single notifier
> registered. I will change the comment to explicitly state this.

Actually, the notifier registration should be able to return a different
retval to state that a callback has already been registered but that
warns only currently so I'm guessing we're stuck with such ugly
"workarounds" for their shortcomings.

I'm gonna take a look whether they can be fixed though so that you don't
have to do this notifier_registered thing.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette