From: Borislav Petkov <bp@alien8.de> To: "Joshi, Mukul" <Mukul.Joshi@amd.com> Cc: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>, "Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>, x86-ml <x86@kernel.org>, lkml <linux-kernel@vger.kernel.org> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Date: Wed, 12 May 2021 23:05:38 +0200 [thread overview] Message-ID: <YJxDIhGnZ5XdukiS@zn.tnic> (raw) In-Reply-To: <DM4PR12MB5263797EB7B2D37C21427A88EE529@DM4PR12MB5263.namprd12.prod.outlook.com> On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote: > SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is > only interested in errors on GPU UMC. So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2. > We cannot know this without is_smca_umc_v2. You don't need it - just export smca_get_bank_type() and test the bank type at the call site. > Maybe. I hope its not too much of a concern if it stays the way it is. That was just a suggestion anyway - it is not code I maintain so not my call. > I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices. > I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices > are two different class of devices. > That is the reason I create a new one. > I am OK to use EDAC priority if that is acceptable. I don't know what's acceptable because I still am unclear as to what that thing is supposed to do. It seems you are interested only in uncorrectable errors. How are those errors reported? #MC exception, deferred interrupt, simply logged in the bank and we find them by polling? Then, the commit message is talking about some "bad page retirement". What does that do? What can the user do when she sees the "Uncorrectable error detected in UMC..." message? It depends on what "retiring" of GPU pages means... In any case, dmesg should issue a human-understandable message about the recovery action being done and what that means for the user: should she replace the GPU, should she ignore, etc, etc. > A system can have multiple GPUs and we only want a single notifier > registered. I will change the comment to explicitly state this. Actually, the notifier registration should be able to return a different retval to state that a callback has already been registered but that warns only currently so I'm guessing we're stuck with such ugly "workarounds" for their shortcomings. I'm gonna take a look whether they can be fixed though so that you don't have to do this notifier_registered thing. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
WARNING: multiple messages have this Message-ID (diff)
From: Borislav Petkov <bp@alien8.de> To: "Joshi, Mukul" <Mukul.Joshi@amd.com> Cc: x86-ml <x86@kernel.org>, "Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>, lkml <linux-kernel@vger.kernel.org>, "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Date: Wed, 12 May 2021 23:05:38 +0200 [thread overview] Message-ID: <YJxDIhGnZ5XdukiS@zn.tnic> (raw) In-Reply-To: <DM4PR12MB5263797EB7B2D37C21427A88EE529@DM4PR12MB5263.namprd12.prod.outlook.com> On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote: > SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is > only interested in errors on GPU UMC. So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2. > We cannot know this without is_smca_umc_v2. You don't need it - just export smca_get_bank_type() and test the bank type at the call site. > Maybe. I hope its not too much of a concern if it stays the way it is. That was just a suggestion anyway - it is not code I maintain so not my call. > I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices. > I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices > are two different class of devices. > That is the reason I create a new one. > I am OK to use EDAC priority if that is acceptable. I don't know what's acceptable because I still am unclear as to what that thing is supposed to do. It seems you are interested only in uncorrectable errors. How are those errors reported? #MC exception, deferred interrupt, simply logged in the bank and we find them by polling? Then, the commit message is talking about some "bad page retirement". What does that do? What can the user do when she sees the "Uncorrectable error detected in UMC..." message? It depends on what "retiring" of GPU pages means... In any case, dmesg should issue a human-understandable message about the recovery action being done and what that means for the user: should she replace the GPU, should she ignore, etc, etc. > A system can have multiple GPUs and we only want a single notifier > registered. I will change the comment to explicitly state this. Actually, the notifier registration should be able to return a different retval to state that a callback has already been registered but that warns only currently so I'm guessing we're stuck with such ugly "workarounds" for their shortcomings. I'm gonna take a look whether they can be fixed though so that you don't have to do this notifier_registered thing. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
next prev parent reply other threads:[~2021-05-12 22:19 UTC|newest] Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-05-12 1:30 [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Mukul Joshi 2021-05-12 9:36 ` Borislav Petkov 2021-05-12 9:36 ` Borislav Petkov 2021-05-12 19:00 ` Joshi, Mukul 2021-05-12 19:00 ` Joshi, Mukul 2021-05-12 21:05 ` Borislav Petkov [this message] 2021-05-12 21:05 ` Borislav Petkov 2021-05-13 3:20 ` Joshi, Mukul 2021-05-13 3:20 ` Joshi, Mukul 2021-05-13 9:53 ` Borislav Petkov 2021-05-13 9:53 ` Borislav Petkov 2021-05-13 14:17 ` Alex Deucher 2021-05-13 14:17 ` Alex Deucher 2021-05-13 14:30 ` Borislav Petkov 2021-05-13 14:30 ` Borislav Petkov 2021-05-13 14:32 ` Alex Deucher 2021-05-13 14:32 ` Alex Deucher 2021-05-13 14:57 ` Borislav Petkov 2021-05-13 14:57 ` Borislav Petkov 2021-05-13 15:02 ` Alex Deucher 2021-05-13 15:02 ` Alex Deucher 2021-05-13 23:14 ` Joshi, Mukul 2021-05-13 23:14 ` Joshi, Mukul 2021-05-14 7:03 ` Borislav Petkov 2021-05-14 7:03 ` Borislav Petkov 2021-05-27 19:54 ` Joshi, Mukul 2021-05-27 19:54 ` Joshi, Mukul 2021-06-03 21:13 ` Yazen Ghannam 2021-06-03 21:13 ` Yazen Ghannam 2021-07-29 23:59 ` Joshi, Mukul 2021-07-29 23:59 ` Joshi, Mukul 2021-09-13 1:31 ` Joshi, Mukul 2021-05-13 23:10 ` Joshi, Mukul 2021-05-13 23:10 ` Joshi, Mukul 2021-05-14 7:05 ` Borislav Petkov 2021-05-14 7:05 ` Borislav Petkov 2021-05-14 13:06 ` Joshi, Mukul 2021-05-14 13:06 ` Joshi, Mukul 2021-05-14 14:38 ` Borislav Petkov 2021-05-14 14:38 ` Borislav Petkov
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=YJxDIhGnZ5XdukiS@zn.tnic \ --to=bp@alien8.de \ --cc=Harish.Kasiviswanathan@amd.com \ --cc=Mukul.Joshi@amd.com \ --cc=amd-gfx@lists.freedesktop.org \ --cc=linux-kernel@vger.kernel.org \ --cc=x86@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.