All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: "Joshi, Mukul" <Mukul.Joshi@amd.com>
To: "Ghannam, Yazen" <Yazen.Ghannam@amd.com>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"bp@alien8.de" <bp@alien8.de>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"mchehab@kernel.org" <mchehab@kernel.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: RE: [PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS
Date: Thu, 23 Sep 2021 15:30:55 +0000	[thread overview]
Message-ID: <DM4PR12MB52639349DF98DB01A3B155EFEEA39@DM4PR12MB5263.namprd12.prod.outlook.com> (raw)
In-Reply-To: <YUyPM7VfYFG/PJmu@yaz-ubuntu>

[AMD Official Use Only]



> -----Original Message-----
> From: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Sent: Thursday, September 23, 2021 10:29 AM
> To: Joshi, Mukul <Mukul.Joshi@amd.com>
> Cc: linux-edac@vger.kernel.org; x86@kernel.org; linux-kernel@vger.kernel.org;
> bp@alien8.de; mingo@redhat.com; mchehab@kernel.org; amd-
> gfx@lists.freedesktop.org
> Subject: Re: [PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran
> RAS
> 
> On Wed, Sep 22, 2021 at 03:36:20PM -0400, Mukul Joshi wrote:
> > On Aldebaran, GPU driver will handle bad page retirement even though
> > UMC is host managed. As a result, register a bad page retirement
> > handler on the mce notifier chain to retire bad pages on Aldebaran.
> >
> 
> I think this should state that the driver will do page retirement for GPU-managed
> memory. As written, it implies that the driver do page retirement in general for
> the system.
> 
ACK. I will update the description.
> ...
> 
> > +
> > +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> > +				    unsigned long val, void *data) {
> > +	struct mce *m = (struct mce *)data;
> > +	struct amdgpu_device *adev = NULL;
> > +	uint32_t gpu_id = 0;
> > +	uint32_t umc_inst = 0;
> > +	uint32_t ch_inst, channel_index = 0;
> > +	struct ras_err_data err_data = {0, 0, 0, NULL};
> > +	struct eeprom_table_record err_rec;
> > +	uint64_t retired_page;
> > +
> > +	/*
> > +	 * If the error was generated in UMC_V2, which belongs to GPU UMCs,
> > +	 * and error occurred in DramECC (Extended error code = 0) then only
> > +	 * process the error, else bail out.
> > +	 */
> > +	if (!m || !((smca_get_bank_type(m->bank) == SMCA_UMC_V2) &&
> > +		    (XEC(m->status, 0x1f) == 0x0)))
> 
> The MCA_STATUS[ErrorCodeExt] field is bits [21:16], so the mask should be
> 0x3f.

Ack. Thanks for catching this.
> 
> > +		return NOTIFY_DONE;
> > +
> > +	/*
> > +	 * If it is correctable error, return.
> > +	 */
> > +	if (mce_is_correctable(m))
> > +		return NOTIFY_OK;
> 
> Shouldn't this be "NOTIFY_DONE" if "don't care" about this error?

The thinking is we want to stop calling further consumers since it's a correctable error in GPU UMC and we are not taking any action about the correctable errors.

Thanks,
Mukul

> 
> Thanks,
> Yazen

  parent reply	other threads:[~2021-09-23 15:31 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-11 15:25 [PATCH 1/3] x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types Naveen Krishna Chatradhi
2021-05-11 15:25 ` [PATCH 2/3] x86/MCE/AMD: Helper function to check UMC v2 Naveen Krishna Chatradhi
2021-05-11 17:34   ` Borislav Petkov
2021-05-12  1:40     ` Joshi, Mukul
2021-05-12  7:26       ` Borislav Petkov
2021-09-13  2:13   ` [PATCHv2 1/2] x86/MCE/AMD: Export smca_get_bank_type symbol Mukul Joshi
2021-09-13  2:13     ` [PATCHv2 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS Mukul Joshi
2021-09-22 11:40       ` Borislav Petkov
2021-09-22 19:43         ` Joshi, Mukul
2021-09-22 19:36       ` [PATCHv3 " Mukul Joshi
2021-09-23 14:29         ` Yazen Ghannam
2021-09-23 14:37           ` Borislav Petkov
2021-09-23 15:31             ` Joshi, Mukul
2021-09-23 15:30           ` Joshi, Mukul [this message]
2021-09-23 17:23             ` Yazen Ghannam
2021-09-23 18:14               ` Borislav Petkov
2021-09-24 19:46                 ` Yazen Ghannam
2021-09-25 11:20                   ` Borislav Petkov
2021-09-27 18:37                     ` Yazen Ghannam
2021-09-23 18:34               ` Joshi, Mukul
2021-09-23 22:04         ` [PATCHv4 " Mukul Joshi
2021-09-24 19:53           ` Yazen Ghannam
2021-09-22 11:33     ` [PATCHv2 1/2] x86/MCE/AMD: Export smca_get_bank_type symbol Borislav Petkov
2021-09-22 16:27       ` Deucher, Alexander
2021-09-22 16:43         ` Borislav Petkov
2021-09-22 16:47           ` Joshi, Mukul
2021-05-11 15:25 ` [PATCH 3/3] x86/mce: Add MCE priority for Accelerator devices Naveen Krishna Chatradhi
2021-05-11 17:27 ` [PATCH 1/3] x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types Borislav Petkov
2021-05-24 16:41   ` Chatradhi, Naveen Krishna
2021-05-25 18:02     ` Borislav Petkov
2021-05-25 20:03       ` Yazen Ghannam
2021-05-25 20:12         ` Borislav Petkov
2021-05-26 16:46 ` [PATCH v2] " Naveen Krishna Chatradhi
2021-05-28 15:25   ` [tip: ras/core] " tip-bot2 for Muralidhara M K

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM4PR12MB52639349DF98DB01A3B155EFEEA39@DM4PR12MB5263.namprd12.prod.outlook.com \
    --to=mukul.joshi@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=mingo@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.