All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: "Joshi, Mukul" <Mukul.Joshi@amd.com>
Cc: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
	x86-ml <x86@kernel.org>, lkml <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Thu, 13 May 2021 11:53:12 +0200	[thread overview]
Message-ID: <YJz3CMBFFIDBzVwX@zn.tnic> (raw)
In-Reply-To: <DM4PR12MB52631035F875B77974FA8D21EE519@DM4PR12MB5263.namprd12.prod.outlook.com>

On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
> I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
> driver when CONFIG_X86_MCE_AMD is not defined.
> I can avoid all that by using is_smca_umc_v2().
> I think it would be cleaner with using is_smca_umc_v2().

See how smca_get_long_name() is exported and export that function the
same way.

To save you some energy: is_smca_umc_v2() is not going to happen.

> You can think of GPU device as a EDAC device here. It is mainly
> interested in handling uncorrectable errors.

An EDAC "device", as you call it, is not interested in handling UEs. If
anything, it counts them.

> It is a deferred interrupt that generates an MCE.

Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?

> When an uncorrectable error is detected on the GPU UMC, all we are
> doing is determining the physical address where the error occurred and
> then "retiring" the page that address belongs to.

What page is that? Normal DRAM page or a page in some special GPU memory?

> By retiring, we mean we reserve the page so that it is not available
> for allocations to any applications.

We do that for normal DRAM memory pages by poisoning them. I hope you
don't mean that.

Looking at

amdgpu_ras_add_bad_pages
|-> amdgpu_vram_mgr_reserve_range

that's some VRAM thing so I'm guessing special memory on the GPU.

If so, what happens with all those "retired" pages when you reboot?
They're getting used again and potentially trigger the same UEs and the
same retiring happens?

> We are providing information to the user by storing all the
> information about the retired pages in EEPROM. This can be accessed
> through sysfs.

Ok, I'm a user and I can access that information through sysfs. What can
I do with it?

> Hope it clears what "bad page retirement" is achieving.

It is getting there.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

WARNING: multiple messages have this Message-ID (diff)
From: Borislav Petkov <bp@alien8.de>
To: "Joshi, Mukul" <Mukul.Joshi@amd.com>
Cc: x86-ml <x86@kernel.org>,
	"Kasiviswanathan, Harish" <Harish.Kasiviswanathan@amd.com>,
	lkml <linux-kernel@vger.kernel.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Thu, 13 May 2021 11:53:12 +0200	[thread overview]
Message-ID: <YJz3CMBFFIDBzVwX@zn.tnic> (raw)
In-Reply-To: <DM4PR12MB52631035F875B77974FA8D21EE519@DM4PR12MB5263.namprd12.prod.outlook.com>

On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
> I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
> driver when CONFIG_X86_MCE_AMD is not defined.
> I can avoid all that by using is_smca_umc_v2().
> I think it would be cleaner with using is_smca_umc_v2().

See how smca_get_long_name() is exported and export that function the
same way.

To save you some energy: is_smca_umc_v2() is not going to happen.

> You can think of GPU device as a EDAC device here. It is mainly
> interested in handling uncorrectable errors.

An EDAC "device", as you call it, is not interested in handling UEs. If
anything, it counts them.

> It is a deferred interrupt that generates an MCE.

Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?

> When an uncorrectable error is detected on the GPU UMC, all we are
> doing is determining the physical address where the error occurred and
> then "retiring" the page that address belongs to.

What page is that? Normal DRAM page or a page in some special GPU memory?

> By retiring, we mean we reserve the page so that it is not available
> for allocations to any applications.

We do that for normal DRAM memory pages by poisoning them. I hope you
don't mean that.

Looking at

amdgpu_ras_add_bad_pages
|-> amdgpu_vram_mgr_reserve_range

that's some VRAM thing so I'm guessing special memory on the GPU.

If so, what happens with all those "retired" pages when you reboot?
They're getting used again and potentially trigger the same UEs and the
same retiring happens?

> We are providing information to the user by storing all the
> information about the retired pages in EEPROM. This can be accessed
> through sysfs.

Ok, I'm a user and I can access that information through sysfs. What can
I do with it?

> Hope it clears what "bad page retirement" is achieving.

It is getting there.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-05-13  9:53 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-12  1:30 [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Mukul Joshi
2021-05-12  9:36 ` Borislav Petkov
2021-05-12  9:36   ` Borislav Petkov
2021-05-12 19:00   ` Joshi, Mukul
2021-05-12 19:00     ` Joshi, Mukul
2021-05-12 21:05     ` Borislav Petkov
2021-05-12 21:05       ` Borislav Petkov
2021-05-13  3:20       ` Joshi, Mukul
2021-05-13  3:20         ` Joshi, Mukul
2021-05-13  9:53         ` Borislav Petkov [this message]
2021-05-13  9:53           ` Borislav Petkov
2021-05-13 14:17           ` Alex Deucher
2021-05-13 14:17             ` Alex Deucher
2021-05-13 14:30             ` Borislav Petkov
2021-05-13 14:30               ` Borislav Petkov
2021-05-13 14:32               ` Alex Deucher
2021-05-13 14:32                 ` Alex Deucher
2021-05-13 14:57                 ` Borislav Petkov
2021-05-13 14:57                   ` Borislav Petkov
2021-05-13 15:02                   ` Alex Deucher
2021-05-13 15:02                     ` Alex Deucher
2021-05-13 23:14                   ` Joshi, Mukul
2021-05-13 23:14                     ` Joshi, Mukul
2021-05-14  7:03                     ` Borislav Petkov
2021-05-14  7:03                       ` Borislav Petkov
2021-05-27 19:54                       ` Joshi, Mukul
2021-05-27 19:54                         ` Joshi, Mukul
2021-06-03 21:13                         ` Yazen Ghannam
2021-06-03 21:13                           ` Yazen Ghannam
2021-07-29 23:59                           ` Joshi, Mukul
2021-07-29 23:59                             ` Joshi, Mukul
2021-09-13  1:31                             ` Joshi, Mukul
2021-05-13 23:10           ` Joshi, Mukul
2021-05-13 23:10             ` Joshi, Mukul
2021-05-14  7:05             ` Borislav Petkov
2021-05-14  7:05               ` Borislav Petkov
2021-05-14 13:06               ` Joshi, Mukul
2021-05-14 13:06                 ` Joshi, Mukul
2021-05-14 14:38                 ` Borislav Petkov
2021-05-14 14:38                   ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YJz3CMBFFIDBzVwX@zn.tnic \
    --to=bp@alien8.de \
    --cc=Harish.Kasiviswanathan@amd.com \
    --cc=Mukul.Joshi@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.