All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: "Joshi, Mukul" <Mukul.Joshi@amd.com>
To: Borislav Petkov <bp@alien8.de>
Cc: x86-ml <x86@kernel.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
	lkml <linux-kernel@vger.kernel.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Thu, 13 May 2021 03:20:36 +0000	[thread overview]
Message-ID: <DM4PR12MB52631035F875B77974FA8D21EE519@DM4PR12MB5263.namprd12.prod.outlook.com> (raw)
In-Reply-To: <YJxDIhGnZ5XdukiS@zn.tnic>

[AMD Official Use Only - Internal Distribution Only]



> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Wednesday, May 12, 2021 5:06 PM
> To: Joshi, Mukul <Mukul.Joshi@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@amd.com>; x86-ml <x86@kernel.org>; lkml <linux-
> kernel@vger.kernel.org>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote:
> > SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> > only interested in errors on GPU UMC.
> 
> So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2.
> 
> > We cannot know this without is_smca_umc_v2.
> 
> You don't need it - just export smca_get_bank_type() and test the bank type at
> the call site.

Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
driver when CONFIG_X86_MCE_AMD is not defined.
I can avoid all that by using is_smca_umc_v2().
I think it would be cleaner with using is_smca_umc_v2().

> 
> > Maybe. I hope its not too much of a concern if it stays the way it is.
> 
> That was just a suggestion anyway - it is not code I maintain so not my call.
> 
> > I wasn't really sure if I should use the EDAC priority here or create a new one
> for Accelerator devices.
> > I thought using EDAC priority might not be accepted by the maintainers
> > as EDAC and GPU (Accelerator) devices are two different class of devices.
> > That is the reason I create a new one.
> > I am OK to use EDAC priority if that is acceptable.
> 
> I don't know what's acceptable because I still am unclear as to what that thing is
> supposed to do. It seems you are interested only in uncorrectable errors.
> 

You can think of GPU device as a EDAC device here. It is mainly interested in
handling uncorrectable errors.

> How are those errors reported? #MC exception, deferred interrupt, simply
> logged in the bank and we find them by polling?
> 

It is a deferred interrupt that generates an MCE.

> Then, the commit message is talking about some "bad page retirement".
> What does that do? What can the user do when she sees the
> 
> "Uncorrectable error detected in UMC..."
> 
> message?
> 
> It depends on what "retiring" of GPU pages means...
> 
> In any case, dmesg should issue a human-understandable message about the
> recovery action being done and what that means for the user: should she
> replace the GPU, should she ignore, etc, etc.
>

When an uncorrectable error is detected on the GPU UMC, all we are doing is 
determining the physical address where the error occurred and then "retiring"
the page that address belongs to.
By retiring, we mean we reserve the page so that it is not available for allocations
to any applications.
We are providing information to the user by storing all the information about the
retired pages in EEPROM. This can be accessed through sysfs.

Hope it clears what "bad page retirement" is achieving.

Thanks,
Mukul

> > A system can have multiple GPUs and we only want a single notifier
> > registered. I will change the comment to explicitly state this.
> 
> Actually, the notifier registration should be able to return a different retval to
> state that a callback has already been registered but that warns only currently so
> I'm guessing we're stuck with such ugly "workarounds" for their shortcomings.
> 
> I'm gonna take a look whether they can be fixed though so that you don't have
> to do this notifier_registered thing.
> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7C5e305845504
> 14bc53c3008d91589b50b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637564503481206042%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =fF2iaSQxAREAV2OwLi%2F8cN9uRuqNNKim2FpWJUBQS98%3D&amp;reserved=
> 0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

WARNING: multiple messages have this Message-ID (diff)
From: "Joshi, Mukul" <Mukul.Joshi@amd.com>
To: Borislav Petkov <bp@alien8.de>
Cc: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
	x86-ml <x86@kernel.org>, lkml <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Thu, 13 May 2021 03:20:36 +0000	[thread overview]
Message-ID: <DM4PR12MB52631035F875B77974FA8D21EE519@DM4PR12MB5263.namprd12.prod.outlook.com> (raw)
In-Reply-To: <YJxDIhGnZ5XdukiS@zn.tnic>

[AMD Official Use Only - Internal Distribution Only]



> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Wednesday, May 12, 2021 5:06 PM
> To: Joshi, Mukul <Mukul.Joshi@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@amd.com>; x86-ml <x86@kernel.org>; lkml <linux-
> kernel@vger.kernel.org>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
> 
> [CAUTION: External Email]
> 
> On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote:
> > SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> > only interested in errors on GPU UMC.
> 
> So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2.
> 
> > We cannot know this without is_smca_umc_v2.
> 
> You don't need it - just export smca_get_bank_type() and test the bank type at
> the call site.

Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
driver when CONFIG_X86_MCE_AMD is not defined.
I can avoid all that by using is_smca_umc_v2().
I think it would be cleaner with using is_smca_umc_v2().

> 
> > Maybe. I hope its not too much of a concern if it stays the way it is.
> 
> That was just a suggestion anyway - it is not code I maintain so not my call.
> 
> > I wasn't really sure if I should use the EDAC priority here or create a new one
> for Accelerator devices.
> > I thought using EDAC priority might not be accepted by the maintainers
> > as EDAC and GPU (Accelerator) devices are two different class of devices.
> > That is the reason I create a new one.
> > I am OK to use EDAC priority if that is acceptable.
> 
> I don't know what's acceptable because I still am unclear as to what that thing is
> supposed to do. It seems you are interested only in uncorrectable errors.
> 

You can think of GPU device as a EDAC device here. It is mainly interested in
handling uncorrectable errors.

> How are those errors reported? #MC exception, deferred interrupt, simply
> logged in the bank and we find them by polling?
> 

It is a deferred interrupt that generates an MCE.

> Then, the commit message is talking about some "bad page retirement".
> What does that do? What can the user do when she sees the
> 
> "Uncorrectable error detected in UMC..."
> 
> message?
> 
> It depends on what "retiring" of GPU pages means...
> 
> In any case, dmesg should issue a human-understandable message about the
> recovery action being done and what that means for the user: should she
> replace the GPU, should she ignore, etc, etc.
>

When an uncorrectable error is detected on the GPU UMC, all we are doing is 
determining the physical address where the error occurred and then "retiring"
the page that address belongs to.
By retiring, we mean we reserve the page so that it is not available for allocations
to any applications.
We are providing information to the user by storing all the information about the
retired pages in EEPROM. This can be accessed through sysfs.

Hope it clears what "bad page retirement" is achieving.

Thanks,
Mukul

> > A system can have multiple GPUs and we only want a single notifier
> > registered. I will change the comment to explicitly state this.
> 
> Actually, the notifier registration should be able to return a different retval to
> state that a callback has already been registered but that warns only currently so
> I'm guessing we're stuck with such ugly "workarounds" for their shortcomings.
> 
> I'm gonna take a look whether they can be fixed though so that you don't have
> to do this notifier_registered thing.
> 
> --
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7C5e305845504
> 14bc53c3008d91589b50b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637564503481206042%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =fF2iaSQxAREAV2OwLi%2F8cN9uRuqNNKim2FpWJUBQS98%3D&amp;reserved=
> 0

  reply	other threads:[~2021-05-13  3:20 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-12  1:30 [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Mukul Joshi
2021-05-12  9:36 ` Borislav Petkov
2021-05-12  9:36   ` Borislav Petkov
2021-05-12 19:00   ` Joshi, Mukul
2021-05-12 19:00     ` Joshi, Mukul
2021-05-12 21:05     ` Borislav Petkov
2021-05-12 21:05       ` Borislav Petkov
2021-05-13  3:20       ` Joshi, Mukul [this message]
2021-05-13  3:20         ` Joshi, Mukul
2021-05-13  9:53         ` Borislav Petkov
2021-05-13  9:53           ` Borislav Petkov
2021-05-13 14:17           ` Alex Deucher
2021-05-13 14:17             ` Alex Deucher
2021-05-13 14:30             ` Borislav Petkov
2021-05-13 14:30               ` Borislav Petkov
2021-05-13 14:32               ` Alex Deucher
2021-05-13 14:32                 ` Alex Deucher
2021-05-13 14:57                 ` Borislav Petkov
2021-05-13 14:57                   ` Borislav Petkov
2021-05-13 15:02                   ` Alex Deucher
2021-05-13 15:02                     ` Alex Deucher
2021-05-13 23:14                   ` Joshi, Mukul
2021-05-13 23:14                     ` Joshi, Mukul
2021-05-14  7:03                     ` Borislav Petkov
2021-05-14  7:03                       ` Borislav Petkov
2021-05-27 19:54                       ` Joshi, Mukul
2021-05-27 19:54                         ` Joshi, Mukul
2021-06-03 21:13                         ` Yazen Ghannam
2021-06-03 21:13                           ` Yazen Ghannam
2021-07-29 23:59                           ` Joshi, Mukul
2021-07-29 23:59                             ` Joshi, Mukul
2021-09-13  1:31                             ` Joshi, Mukul
2021-05-13 23:10           ` Joshi, Mukul
2021-05-13 23:10             ` Joshi, Mukul
2021-05-14  7:05             ` Borislav Petkov
2021-05-14  7:05               ` Borislav Petkov
2021-05-14 13:06               ` Joshi, Mukul
2021-05-14 13:06                 ` Joshi, Mukul
2021-05-14 14:38                 ` Borislav Petkov
2021-05-14 14:38                   ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM4PR12MB52631035F875B77974FA8D21EE519@DM4PR12MB5263.namprd12.prod.outlook.com \
    --to=mukul.joshi@amd.com \
    --cc=Harish.Kasiviswanathan@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=bp@alien8.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.