Re: [PATCH 1/1] drm/i915: Reset request handling for gen9+

From: Tomas Elf <tomas.elf@intel.com>
To: Mika Kuoppala <mika.kuoppala@linux.intel.com>,
	intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/1] drm/i915: Reset request handling for gen9+
Date: Tue, 16 Jun 2015 20:57:47 +0100	[thread overview]
Message-ID: <55807FBB.4090002@intel.com> (raw)
In-Reply-To: <1434461963-2438-1-git-send-email-mika.kuoppala@intel.com>

On 16/06/2015 14:39, Mika Kuoppala wrote:
> In order for skl+ hardware to guarantee that no context switch
> takes place during reset and that current context is properly
> saved, the driver needs to notify and query hw before commencing
> with reset.
>
> We will only proceed with reset if all engines report that they
> are ready for reset.
>
> As we skip the reset if any single engine reports not ready, this
> commit prevents system hang skl in some situations where the
> gpu/blitter is hanged and in such state that any write to generic
> reset register (GEN6_GDRST) causes immediate system hang.

If it solves an observed problem then that's great. What worries me 
slightly is that we seem to be disabling full GPU reset permanently in 
the case where one or more engines have decided for whatever reason to 
never be ready for reset (who knows what the hardware could be up to?). 
In that case we're permanently toast. Would it make sense to only 
accommodate the engine and attempt reset request a few times and if the 
reset request fails x times in a row we simply ignore the outcome and 
move ahead with the full GPU reset anyway? I mean, at that point, what 
do we got to lose?

If we look beyond this patch for a moment and consider the effects of 
combining this patch with my per-engine reset support RFC series, what 
would happen is the following:

0) Hang detected

1) Engine reset request.

2a) If reset request ok, engine reset -> DONE.

2b) If reset request not ok -> clear reset request bit and FAIL engine 
reset. Go to full GPU reset promotion in step 3).

3) Promote to full GPU reset

4) (In this case there's currently no reset request in the RFC since 
I've never heard anyway say that reset request was necessary when doing 
full GPU reset, only in the engine reset case - we're nuking everything 
anyway. We could do what you're doing here and do a reset request for 
all engines)

5a) If all reset requests are ok, do full GPU resest -> DONE.

5b) If some reset requests are not ok -> Go back to 4) and retry a 
couple of times until we give up and simply reset the GPU as a last resort.

What's interesting here is that we would always request reset both for
both per-engine reset, in which case there _is_ a fall-back path in case 
the reset request fails - promote to full GPU reset), and for full GPU 
reset, in which case we could back off and retry the reset request a 
couple of times and then just ignore the reset request outcome if we 
wanted to.

>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_reg.h     |  3 +++
>   drivers/gpu/drm/i915/intel_uncore.c | 32 +++++++++++++++++++++++++++++++-
>   2 files changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 0b979ad..3684f92 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
>   #define RING_MAX_IDLE(base)	((base)+0x54)
>   #define RING_HWS_PGA(base)	((base)+0x80)
>   #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
> +#define RING_RESET_CTL(base)	((base)+0xd0)
> +#define   RESET_CTL_REQUEST_RESET  (1 << 0)
> +#define   RESET_CTL_READY_TO_RESET (1 << 1)
>
>   #define HSW_GTT_CACHE_EN	0x4024
>   #define   GTT_CACHE_EN_ALL	0xF0007FFF
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 4a86cf0..404bce2 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1455,9 +1455,39 @@ static int gen6_do_reset(struct drm_device *dev)
>   	return ret;
>   }
>
> +static int wait_for_bits_set(struct drm_i915_private *dev_priv,
> +			     const u32 reg, const u32 mask, const int timeout)
> +{
> +	return wait_for((I915_READ(reg) & mask) == mask, timeout);
> +}
> +
> +static int gen9_do_reset(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct intel_engine_cs *engine;
> +	int ret, i;
> +
> +	for_each_ring(engine, dev_priv, i) {
> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> +			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
> +
> +		ret = wait_for_bits_set(dev_priv,
> +					RING_RESET_CTL(engine->mmio_base),
> +					RESET_CTL_READY_TO_RESET, 700);
> +		if (ret) {
> +			DRM_ERROR("%s: reset request timeout\n", engine->name);
> +			return -ENODEV;

You could clear the reset request bit at this point in order to back off 
from the reset request. I don't know what fall-back procedure would make 
most sense following that point but, hey, that's just one way of doing 
it. It would theoretically allow the command streamer to resume 
executing but then again, we're here because it's hung so I don't know 
if the engine is likely to resume doing anything following this point.

> +		}
> +	}
> +
> +	return gen6_do_reset(dev);
> +}
> +
>   static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
>   {
> -	if (INTEL_INFO(dev)->gen >= 6)
> +	if (INTEL_INFO(dev)->gen >= 9)

This is actually applicable for gen8+ (it's part of my RFC from last 
week) and is the only way to idle an engine preceding a reset so you 
might as well generalise it to gen8 and onwards, not only gen9.

Thanks,
Tomas

> +		return gen9_do_reset;
> +	else if (INTEL_INFO(dev)->gen >= 6)
>   		return gen6_do_reset;
>   	else if (IS_GEN5(dev))
>   		return ironlake_do_reset;
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx