All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] drm/i915: Reset request handling for gen9+
@ 2015-06-16 13:39 Mika Kuoppala
  2015-06-16 14:09 ` Chris Wilson
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Mika Kuoppala @ 2015-06-16 13:39 UTC (permalink / raw)
  To: intel-gfx

In order for skl+ hardware to guarantee that no context switch
takes place during reset and that current context is properly
saved, the driver needs to notify and query hw before commencing
with reset.

We will only proceed with reset if all engines report that they
are ready for reset.

As we skip the reset if any single engine reports not ready, this
commit prevents system hang skl in some situations where the
gpu/blitter is hanged and in such state that any write to generic
reset register (GEN6_GDRST) causes immediate system hang.

References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_reg.h     |  3 +++
 drivers/gpu/drm/i915/intel_uncore.c | 32 +++++++++++++++++++++++++++++++-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 0b979ad..3684f92 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
 #define RING_MAX_IDLE(base)	((base)+0x54)
 #define RING_HWS_PGA(base)	((base)+0x80)
 #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
+#define RING_RESET_CTL(base)	((base)+0xd0)
+#define   RESET_CTL_REQUEST_RESET  (1 << 0)
+#define   RESET_CTL_READY_TO_RESET (1 << 1)
 
 #define HSW_GTT_CACHE_EN	0x4024
 #define   GTT_CACHE_EN_ALL	0xF0007FFF
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 4a86cf0..404bce2 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1455,9 +1455,39 @@ static int gen6_do_reset(struct drm_device *dev)
 	return ret;
 }
 
+static int wait_for_bits_set(struct drm_i915_private *dev_priv,
+			     const u32 reg, const u32 mask, const int timeout)
+{
+	return wait_for((I915_READ(reg) & mask) == mask, timeout);
+}
+
+static int gen9_do_reset(struct drm_device *dev)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct intel_engine_cs *engine;
+	int ret, i;
+
+	for_each_ring(engine, dev_priv, i) {
+		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
+			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
+
+		ret = wait_for_bits_set(dev_priv,
+					RING_RESET_CTL(engine->mmio_base),
+					RESET_CTL_READY_TO_RESET, 700);
+		if (ret) {
+			DRM_ERROR("%s: reset request timeout\n", engine->name);
+			return -ENODEV;
+		}
+	}
+
+	return gen6_do_reset(dev);
+}
+
 static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
 {
-	if (INTEL_INFO(dev)->gen >= 6)
+	if (INTEL_INFO(dev)->gen >= 9)
+		return gen9_do_reset;
+	else if (INTEL_INFO(dev)->gen >= 6)
 		return gen6_do_reset;
 	else if (IS_GEN5(dev))
 		return ironlake_do_reset;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] drm/i915: Reset request handling for gen9+
  2015-06-16 13:39 [PATCH 1/1] drm/i915: Reset request handling for gen9+ Mika Kuoppala
@ 2015-06-16 14:09 ` Chris Wilson
  2015-06-16 17:10 ` Chris Wilson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Chris Wilson @ 2015-06-16 14:09 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Tue, Jun 16, 2015 at 04:39:23PM +0300, Mika Kuoppala wrote:
> In order for skl+ hardware to guarantee that no context switch
> takes place during reset and that current context is properly
> saved, the driver needs to notify and query hw before commencing
> with reset.
> 
> We will only proceed with reset if all engines report that they
> are ready for reset.
> 
> As we skip the reset if any single engine reports not ready, this
> commit prevents system hang skl in some situations where the
> gpu/blitter is hanged and in such state that any write to generic
> reset register (GEN6_GDRST) causes immediate system hang.
> 
> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_reg.h     |  3 +++
>  drivers/gpu/drm/i915/intel_uncore.c | 32 +++++++++++++++++++++++++++++++-
>  2 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 0b979ad..3684f92 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
>  #define RING_MAX_IDLE(base)	((base)+0x54)
>  #define RING_HWS_PGA(base)	((base)+0x80)
>  #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
> +#define RING_RESET_CTL(base)	((base)+0xd0)
> +#define   RESET_CTL_REQUEST_RESET  (1 << 0)
> +#define   RESET_CTL_READY_TO_RESET (1 << 1)
>  
>  #define HSW_GTT_CACHE_EN	0x4024
>  #define   GTT_CACHE_EN_ALL	0xF0007FFF
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 4a86cf0..404bce2 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1455,9 +1455,39 @@ static int gen6_do_reset(struct drm_device *dev)
>  	return ret;
>  }
>  
> +static int wait_for_bits_set(struct drm_i915_private *dev_priv,
> +			     const u32 reg, const u32 mask, const int timeout)

Use whitespace to group terms, and probably best to call it with both
mask and value for generality.

static int wait_for_register(struct drm_i915_private *dev_priv,
			     const u32 reg,
			     const u32 mask,
			     const u32 value,
			     const unsigend long timeout);

I hope this proves useful elsewhere, do you have a followup patch? It
should reduce the size of our module quite considerably.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] drm/i915: Reset request handling for gen9+
  2015-06-16 13:39 [PATCH 1/1] drm/i915: Reset request handling for gen9+ Mika Kuoppala
  2015-06-16 14:09 ` Chris Wilson
@ 2015-06-16 17:10 ` Chris Wilson
  2015-06-16 20:15   ` Tomas Elf
  2015-06-16 19:57 ` Tomas Elf
  2015-06-17 12:35 ` [PATCH] drm/i915: Reset request handling for gen8+ Mika Kuoppala
  3 siblings, 1 reply; 21+ messages in thread
From: Chris Wilson @ 2015-06-16 17:10 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Tue, Jun 16, 2015 at 04:39:23PM +0300, Mika Kuoppala wrote:
> In order for skl+ hardware to guarantee that no context switch
> takes place during reset and that current context is properly
> saved, the driver needs to notify and query hw before commencing
> with reset.
> 
> We will only proceed with reset if all engines report that they
> are ready for reset.
> 
> As we skip the reset if any single engine reports not ready, this
> commit prevents system hang skl in some situations where the
> gpu/blitter is hanged and in such state that any write to generic

s/is hanged/is wedged/ reads better

> reset register (GEN6_GDRST) causes immediate system hang.
> 
> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_reg.h     |  3 +++
>  drivers/gpu/drm/i915/intel_uncore.c | 32 +++++++++++++++++++++++++++++++-
>  2 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 0b979ad..3684f92 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
>  #define RING_MAX_IDLE(base)	((base)+0x54)
>  #define RING_HWS_PGA(base)	((base)+0x80)
>  #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
> +#define RING_RESET_CTL(base)	((base)+0xd0)
> +#define   RESET_CTL_REQUEST_RESET  (1 << 0)
> +#define   RESET_CTL_READY_TO_RESET (1 << 1)
>  
>  #define HSW_GTT_CACHE_EN	0x4024
>  #define   GTT_CACHE_EN_ALL	0xF0007FFF
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 4a86cf0..404bce2 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1455,9 +1455,39 @@ static int gen6_do_reset(struct drm_device *dev)
>  	return ret;
>  }
>  
> +static int wait_for_bits_set(struct drm_i915_private *dev_priv,
> +			     const u32 reg, const u32 mask, const int timeout)
> +{
> +	return wait_for((I915_READ(reg) & mask) == mask, timeout);
> +}
> +
> +static int gen9_do_reset(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct intel_engine_cs *engine;
> +	int ret, i;
> +
> +	for_each_ring(engine, dev_priv, i) {
> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> +			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
> +
> +		ret = wait_for_bits_set(dev_priv,
> +					RING_RESET_CTL(engine->mmio_base),
> +					RESET_CTL_READY_TO_RESET, 700);
> +		if (ret) {
> +			DRM_ERROR("%s: reset request timeout\n", engine->name);
> +			return -ENODEV;

return -EIO; since the reset didn't happen due to hardware issues
(ENODEV is that we don't have the implementation for the GPU rather than
it failed).

Do we need any recovery? Do you guarrantee that the GPU reset resets the
CTL register?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] drm/i915: Reset request handling for gen9+
  2015-06-16 13:39 [PATCH 1/1] drm/i915: Reset request handling for gen9+ Mika Kuoppala
  2015-06-16 14:09 ` Chris Wilson
  2015-06-16 17:10 ` Chris Wilson
@ 2015-06-16 19:57 ` Tomas Elf
  2015-06-17 12:35 ` [PATCH] drm/i915: Reset request handling for gen8+ Mika Kuoppala
  3 siblings, 0 replies; 21+ messages in thread
From: Tomas Elf @ 2015-06-16 19:57 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

On 16/06/2015 14:39, Mika Kuoppala wrote:
> In order for skl+ hardware to guarantee that no context switch
> takes place during reset and that current context is properly
> saved, the driver needs to notify and query hw before commencing
> with reset.
>
> We will only proceed with reset if all engines report that they
> are ready for reset.
>
> As we skip the reset if any single engine reports not ready, this
> commit prevents system hang skl in some situations where the
> gpu/blitter is hanged and in such state that any write to generic
> reset register (GEN6_GDRST) causes immediate system hang.

If it solves an observed problem then that's great. What worries me 
slightly is that we seem to be disabling full GPU reset permanently in 
the case where one or more engines have decided for whatever reason to 
never be ready for reset (who knows what the hardware could be up to?). 
In that case we're permanently toast. Would it make sense to only 
accommodate the engine and attempt reset request a few times and if the 
reset request fails x times in a row we simply ignore the outcome and 
move ahead with the full GPU reset anyway? I mean, at that point, what 
do we got to lose?

If we look beyond this patch for a moment and consider the effects of 
combining this patch with my per-engine reset support RFC series, what 
would happen is the following:

0) Hang detected

1) Engine reset request.

2a) If reset request ok, engine reset -> DONE.

2b) If reset request not ok -> clear reset request bit and FAIL engine 
reset. Go to full GPU reset promotion in step 3).

3) Promote to full GPU reset

4) (In this case there's currently no reset request in the RFC since 
I've never heard anyway say that reset request was necessary when doing 
full GPU reset, only in the engine reset case - we're nuking everything 
anyway. We could do what you're doing here and do a reset request for 
all engines)

5a) If all reset requests are ok, do full GPU resest -> DONE.

5b) If some reset requests are not ok -> Go back to 4) and retry a 
couple of times until we give up and simply reset the GPU as a last resort.

What's interesting here is that we would always request reset both for
both per-engine reset, in which case there _is_ a fall-back path in case 
the reset request fails - promote to full GPU reset), and for full GPU 
reset, in which case we could back off and retry the reset request a 
couple of times and then just ignore the reset request outcome if we 
wanted to.

>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_reg.h     |  3 +++
>   drivers/gpu/drm/i915/intel_uncore.c | 32 +++++++++++++++++++++++++++++++-
>   2 files changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 0b979ad..3684f92 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
>   #define RING_MAX_IDLE(base)	((base)+0x54)
>   #define RING_HWS_PGA(base)	((base)+0x80)
>   #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
> +#define RING_RESET_CTL(base)	((base)+0xd0)
> +#define   RESET_CTL_REQUEST_RESET  (1 << 0)
> +#define   RESET_CTL_READY_TO_RESET (1 << 1)
>
>   #define HSW_GTT_CACHE_EN	0x4024
>   #define   GTT_CACHE_EN_ALL	0xF0007FFF
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 4a86cf0..404bce2 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1455,9 +1455,39 @@ static int gen6_do_reset(struct drm_device *dev)
>   	return ret;
>   }
>
> +static int wait_for_bits_set(struct drm_i915_private *dev_priv,
> +			     const u32 reg, const u32 mask, const int timeout)
> +{
> +	return wait_for((I915_READ(reg) & mask) == mask, timeout);
> +}
> +
> +static int gen9_do_reset(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct intel_engine_cs *engine;
> +	int ret, i;
> +
> +	for_each_ring(engine, dev_priv, i) {
> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> +			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
> +
> +		ret = wait_for_bits_set(dev_priv,
> +					RING_RESET_CTL(engine->mmio_base),
> +					RESET_CTL_READY_TO_RESET, 700);
> +		if (ret) {
> +			DRM_ERROR("%s: reset request timeout\n", engine->name);
> +			return -ENODEV;

You could clear the reset request bit at this point in order to back off 
from the reset request. I don't know what fall-back procedure would make 
most sense following that point but, hey, that's just one way of doing 
it. It would theoretically allow the command streamer to resume 
executing but then again, we're here because it's hung so I don't know 
if the engine is likely to resume doing anything following this point.

> +		}
> +	}
> +
> +	return gen6_do_reset(dev);
> +}
> +
>   static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
>   {
> -	if (INTEL_INFO(dev)->gen >= 6)
> +	if (INTEL_INFO(dev)->gen >= 9)

This is actually applicable for gen8+ (it's part of my RFC from last 
week) and is the only way to idle an engine preceding a reset so you 
might as well generalise it to gen8 and onwards, not only gen9.

Thanks,
Tomas

> +		return gen9_do_reset;
> +	else if (INTEL_INFO(dev)->gen >= 6)
>   		return gen6_do_reset;
>   	else if (IS_GEN5(dev))
>   		return ironlake_do_reset;
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] drm/i915: Reset request handling for gen9+
  2015-06-16 17:10 ` Chris Wilson
@ 2015-06-16 20:15   ` Tomas Elf
  2015-06-17  6:33     ` Mika Kuoppala
  0 siblings, 1 reply; 21+ messages in thread
From: Tomas Elf @ 2015-06-16 20:15 UTC (permalink / raw)
  To: Chris Wilson, Mika Kuoppala, intel-gfx

On 16/06/2015 18:10, Chris Wilson wrote:
> On Tue, Jun 16, 2015 at 04:39:23PM +0300, Mika Kuoppala wrote:
>> In order for skl+ hardware to guarantee that no context switch
>> takes place during reset and that current context is properly
>> saved, the driver needs to notify and query hw before commencing
>> with reset.
>>
>> We will only proceed with reset if all engines report that they
>> are ready for reset.
>>
>> As we skip the reset if any single engine reports not ready, this
>> commit prevents system hang skl in some situations where the
>> gpu/blitter is hanged and in such state that any write to generic
>
> s/is hanged/is wedged/ reads better
>
>> reset register (GEN6_GDRST) causes immediate system hang.
>>
>> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
>> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_reg.h     |  3 +++
>>   drivers/gpu/drm/i915/intel_uncore.c | 32 +++++++++++++++++++++++++++++++-
>>   2 files changed, 34 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index 0b979ad..3684f92 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
>>   #define RING_MAX_IDLE(base)	((base)+0x54)
>>   #define RING_HWS_PGA(base)	((base)+0x80)
>>   #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
>> +#define RING_RESET_CTL(base)	((base)+0xd0)
>> +#define   RESET_CTL_REQUEST_RESET  (1 << 0)
>> +#define   RESET_CTL_READY_TO_RESET (1 << 1)
>>
>>   #define HSW_GTT_CACHE_EN	0x4024
>>   #define   GTT_CACHE_EN_ALL	0xF0007FFF
>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> index 4a86cf0..404bce2 100644
>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> @@ -1455,9 +1455,39 @@ static int gen6_do_reset(struct drm_device *dev)
>>   	return ret;
>>   }
>>
>> +static int wait_for_bits_set(struct drm_i915_private *dev_priv,
>> +			     const u32 reg, const u32 mask, const int timeout)
>> +{
>> +	return wait_for((I915_READ(reg) & mask) == mask, timeout);
>> +}
>> +
>> +static int gen9_do_reset(struct drm_device *dev)
>> +{
>> +	struct drm_i915_private *dev_priv = dev->dev_private;
>> +	struct intel_engine_cs *engine;
>> +	int ret, i;
>> +
>> +	for_each_ring(engine, dev_priv, i) {
>> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
>> +			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>> +
>> +		ret = wait_for_bits_set(dev_priv,
>> +					RING_RESET_CTL(engine->mmio_base),
>> +					RESET_CTL_READY_TO_RESET, 700);
>> +		if (ret) {
>> +			DRM_ERROR("%s: reset request timeout\n", engine->name);
>> +			return -ENODEV;
>
> return -EIO; since the reset didn't happen due to hardware issues
> (ENODEV is that we don't have the implementation for the GPU rather than
> it failed).
>
> Do we need any recovery? Do you guarrantee that the GPU reset resets the
> CTL register?
> -Chris

According to the bspec (if I remember correctly from the last time I had 
to deal with it - Mika, correct me if I'm way off here):

If the reset request succeeds the reset request bit is cleared and 
ready_to_reset is set. Following the engine reset both ready_to_reset 
and reset request bits are set to 0. If the reset request fails the 
reset_request bit is obviously still set.

Then again, all of this is assuming engine resets rather than a full GPU 
reset. The bspec does not say anything about what the effect of a full 
gpu reset is on the reset control registers. It's always seemed to me 
like the reset control register is only relevant when doing a per-engine 
reset rather than a full GPU reset but I might very well be wrong about 
that, especially since you guys have seen problems when not involving 
this reset handshake before doing full GPU resets.

Thanks,
Tomas

>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/1] drm/i915: Reset request handling for gen9+
  2015-06-16 20:15   ` Tomas Elf
@ 2015-06-17  6:33     ` Mika Kuoppala
  0 siblings, 0 replies; 21+ messages in thread
From: Mika Kuoppala @ 2015-06-17  6:33 UTC (permalink / raw)
  To: Tomas Elf, Chris Wilson, intel-gfx

Tomas Elf <tomas.elf@intel.com> writes:

> On 16/06/2015 18:10, Chris Wilson wrote:
>> On Tue, Jun 16, 2015 at 04:39:23PM +0300, Mika Kuoppala wrote:
>>> In order for skl+ hardware to guarantee that no context switch
>>> takes place during reset and that current context is properly
>>> saved, the driver needs to notify and query hw before commencing
>>> with reset.
>>>
>>> We will only proceed with reset if all engines report that they
>>> are ready for reset.
>>>
>>> As we skip the reset if any single engine reports not ready, this
>>> commit prevents system hang skl in some situations where the
>>> gpu/blitter is hanged and in such state that any write to generic
>>
>> s/is hanged/is wedged/ reads better
>>
>>> reset register (GEN6_GDRST) causes immediate system hang.
>>>
>>> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
>>> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
>>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_reg.h     |  3 +++
>>>   drivers/gpu/drm/i915/intel_uncore.c | 32 +++++++++++++++++++++++++++++++-
>>>   2 files changed, 34 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>>> index 0b979ad..3684f92 100644
>>> --- a/drivers/gpu/drm/i915/i915_reg.h
>>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>>> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
>>>   #define RING_MAX_IDLE(base)	((base)+0x54)
>>>   #define RING_HWS_PGA(base)	((base)+0x80)
>>>   #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
>>> +#define RING_RESET_CTL(base)	((base)+0xd0)
>>> +#define   RESET_CTL_REQUEST_RESET  (1 << 0)
>>> +#define   RESET_CTL_READY_TO_RESET (1 << 1)
>>>
>>>   #define HSW_GTT_CACHE_EN	0x4024
>>>   #define   GTT_CACHE_EN_ALL	0xF0007FFF
>>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>>> index 4a86cf0..404bce2 100644
>>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>>> @@ -1455,9 +1455,39 @@ static int gen6_do_reset(struct drm_device *dev)
>>>   	return ret;
>>>   }
>>>
>>> +static int wait_for_bits_set(struct drm_i915_private *dev_priv,
>>> +			     const u32 reg, const u32 mask, const int timeout)
>>> +{
>>> +	return wait_for((I915_READ(reg) & mask) == mask, timeout);
>>> +}
>>> +
>>> +static int gen9_do_reset(struct drm_device *dev)
>>> +{
>>> +	struct drm_i915_private *dev_priv = dev->dev_private;
>>> +	struct intel_engine_cs *engine;
>>> +	int ret, i;
>>> +
>>> +	for_each_ring(engine, dev_priv, i) {
>>> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
>>> +			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>>> +
>>> +		ret = wait_for_bits_set(dev_priv,
>>> +					RING_RESET_CTL(engine->mmio_base),
>>> +					RESET_CTL_READY_TO_RESET, 700);
>>> +		if (ret) {
>>> +			DRM_ERROR("%s: reset request timeout\n", engine->name);
>>> +			return -ENODEV;
>>
>> return -EIO; since the reset didn't happen due to hardware issues
>> (ENODEV is that we don't have the implementation for the GPU rather than
>> it failed).
>>
>> Do we need any recovery? Do you guarrantee that the GPU reset resets the
>> CTL register?
>> -Chris
>
> According to the bspec (if I remember correctly from the last time I had 
> to deal with it - Mika, correct me if I'm way off here):
>
> If the reset request succeeds the reset request bit is cleared and 
> ready_to_reset is set. Following the engine reset both ready_to_reset 
> and reset request bits are set to 0. If the reset request fails the 
> reset_request bit is obviously still set.
>
> Then again, all of this is assuming engine resets rather than a full GPU 
> reset. The bspec does not say anything about what the effect of a full 
> gpu reset is on the reset control registers. It's always seemed to me 
> like the reset control register is only relevant when doing a per-engine 
> reset rather than a full GPU reset but I might very well be wrong about 
> that, especially since you guys have seen problems when not involving 
> this reset handshake before doing full GPU resets.
>

I don't know if this is needed before doing full gpu reset. But
as things are with current skl hardware, if blitter ring 
says it's not ready to reset, you better not write to 
the 0xc0 or you end up with system hang.

So currently this is just a way to let some resets through
and avoid the fatal ones. gem_concurrent_blit seems to be an
excellent choice of killing the gpu/blitter engine in such way
that no normal reset recovery is possible.

-Mika
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-16 13:39 [PATCH 1/1] drm/i915: Reset request handling for gen9+ Mika Kuoppala
                   ` (2 preceding siblings ...)
  2015-06-16 19:57 ` Tomas Elf
@ 2015-06-17 12:35 ` Mika Kuoppala
  2015-06-18  8:36   ` Mika Kuoppala
  3 siblings, 1 reply; 21+ messages in thread
From: Mika Kuoppala @ 2015-06-17 12:35 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

In order for skl+ hardware to guarantee that no context switch
takes place during engine reset and that current context is properly
saved, the driver needs to notify and query hw before commencing
with reset.

We will only proceed with reset if all engines report that they
are ready for reset.

As we skip the reset if any single engine reports not ready, this
commit prevents system hang skl in some situations where the
gpu/blitter is hanged and in such state that any write to generic
reset register (GEN6_GDRST) causes immediate system hang.

v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)

References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_reg.h     |  3 +++
 drivers/gpu/drm/i915/intel_uncore.c | 45 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 0b979ad..3684f92 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
 #define RING_MAX_IDLE(base)	((base)+0x54)
 #define RING_HWS_PGA(base)	((base)+0x80)
 #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
+#define RING_RESET_CTL(base)	((base)+0xd0)
+#define   RESET_CTL_REQUEST_RESET  (1 << 0)
+#define   RESET_CTL_READY_TO_RESET (1 << 1)
 
 #define HSW_GTT_CACHE_EN	0x4024
 #define   GTT_CACHE_EN_ALL	0xF0007FFF
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 4a86cf0..6a19b3e 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1455,9 +1455,52 @@ static int gen6_do_reset(struct drm_device *dev)
 	return ret;
 }
 
+static int wait_for_register(struct drm_i915_private *dev_priv,
+			     const u32 reg,
+			     const u32 mask,
+			     const u32 value,
+			     const unsigned long timeout)
+{
+	return wait_for((I915_READ(reg) & mask) == value, timeout);
+}
+
+static int gen8_do_reset(struct drm_device *dev)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct intel_engine_cs *engine;
+	int ret, i;
+
+	for_each_ring(engine, dev_priv, i) {
+		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
+			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
+
+		ret = wait_for_register(dev_priv,
+					RING_RESET_CTL(engine->mmio_base),
+					RESET_CTL_READY_TO_RESET,
+					RESET_CTL_READY_TO_RESET,
+					700);
+		if (ret) {
+			DRM_ERROR("%s: reset request timeout\n", engine->name);
+			ret = -EIO;
+			goto not_ready;
+		}
+	}
+
+	return gen6_do_reset(dev);
+
+not_ready:
+	for_each_ring(engine, dev_priv, i)
+		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
+			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
+
+	return ret;
+}
+
 static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
 {
-	if (INTEL_INFO(dev)->gen >= 6)
+	if (INTEL_INFO(dev)->gen >= 8)
+		return gen8_do_reset;
+	else if (INTEL_INFO(dev)->gen >= 6)
 		return gen6_do_reset;
 	else if (IS_GEN5(dev))
 		return ironlake_do_reset;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-17 12:35 ` [PATCH] drm/i915: Reset request handling for gen8+ Mika Kuoppala
@ 2015-06-18  8:36   ` Mika Kuoppala
  2015-06-18  8:50     ` Chris Wilson
  0 siblings, 1 reply; 21+ messages in thread
From: Mika Kuoppala @ 2015-06-18  8:36 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

In order for gen8+ hardware to guarantee that no context switch
takes place during engine reset and that current context is properly
saved, the driver needs to notify and query hw before commencing
with reset.

There are gpu hangs where the engine gets so stuck that it never will
report to be ready for reset. We could proceed with reset anyway, but
with some hangs with skl, the forced gpu reset will result in a system
hang. By inspecting the unreadiness for reset seems to correlate with
the probable system hang.

We will only proceed with reset if all engines report that they
are ready for reset. If root cause for system hang is found and
can be worked around with another means, we can reconsider if
we can reinstate full reset for unreadiness case.

v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)
v3: updated commit msg

References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
Testcase: igt/gem_concurrent_blit --r prw-blt-overwrite-source-read-rcs-forked
Testcase: igt/gem_concurrent_blit --r gtt-blt-overwrite-source-read-rcs-forked
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_reg.h     |  3 +++
 drivers/gpu/drm/i915/intel_uncore.c | 45 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 0b979ad..3684f92 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
 #define RING_MAX_IDLE(base)	((base)+0x54)
 #define RING_HWS_PGA(base)	((base)+0x80)
 #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
+#define RING_RESET_CTL(base)	((base)+0xd0)
+#define   RESET_CTL_REQUEST_RESET  (1 << 0)
+#define   RESET_CTL_READY_TO_RESET (1 << 1)
 
 #define HSW_GTT_CACHE_EN	0x4024
 #define   GTT_CACHE_EN_ALL	0xF0007FFF
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 4a86cf0..6a19b3e 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1455,9 +1455,52 @@ static int gen6_do_reset(struct drm_device *dev)
 	return ret;
 }
 
+static int wait_for_register(struct drm_i915_private *dev_priv,
+			     const u32 reg,
+			     const u32 mask,
+			     const u32 value,
+			     const unsigned long timeout)
+{
+	return wait_for((I915_READ(reg) & mask) == value, timeout);
+}
+
+static int gen8_do_reset(struct drm_device *dev)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct intel_engine_cs *engine;
+	int ret, i;
+
+	for_each_ring(engine, dev_priv, i) {
+		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
+			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
+
+		ret = wait_for_register(dev_priv,
+					RING_RESET_CTL(engine->mmio_base),
+					RESET_CTL_READY_TO_RESET,
+					RESET_CTL_READY_TO_RESET,
+					700);
+		if (ret) {
+			DRM_ERROR("%s: reset request timeout\n", engine->name);
+			ret = -EIO;
+			goto not_ready;
+		}
+	}
+
+	return gen6_do_reset(dev);
+
+not_ready:
+	for_each_ring(engine, dev_priv, i)
+		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
+			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
+
+	return ret;
+}
+
 static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
 {
-	if (INTEL_INFO(dev)->gen >= 6)
+	if (INTEL_INFO(dev)->gen >= 8)
+		return gen8_do_reset;
+	else if (INTEL_INFO(dev)->gen >= 6)
 		return gen6_do_reset;
 	else if (IS_GEN5(dev))
 		return ironlake_do_reset;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18  8:36   ` Mika Kuoppala
@ 2015-06-18  8:50     ` Chris Wilson
  2015-06-18  9:51       ` Mika Kuoppala
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Wilson @ 2015-06-18  8:50 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: Daniel Vetter, intel-gfx

On Thu, Jun 18, 2015 at 11:36:00AM +0300, Mika Kuoppala wrote:
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 0b979ad..3684f92 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
>  #define RING_MAX_IDLE(base)	((base)+0x54)
>  #define RING_HWS_PGA(base)	((base)+0x80)
>  #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
> +#define RING_RESET_CTL(base)	((base)+0xd0)
> +#define   RESET_CTL_REQUEST_RESET  (1 << 0)
> +#define   RESET_CTL_READY_TO_RESET (1 << 1)
>  
>  #define HSW_GTT_CACHE_EN	0x4024
>  #define   GTT_CACHE_EN_ALL	0xF0007FFF
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 4a86cf0..6a19b3e 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1455,9 +1455,52 @@ static int gen6_do_reset(struct drm_device *dev)
>  	return ret;
>  }
>  
> +static int wait_for_register(struct drm_i915_private *dev_priv,
> +			     const u32 reg,
> +			     const u32 mask,
> +			     const u32 value,
> +			     const unsigned long timeout)

To be overly fussy, timeout_ms.

I like having units for frequently confused variables like timeouts.

(I am being fussy, because I like this function and expect to convert
lots of callsites over to it, so clarity is important.)

> +{
> +	return wait_for((I915_READ(reg) & mask) == value, timeout);
> +}
> +
> +static int gen8_do_reset(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct intel_engine_cs *engine;
> +	int ret, i;
> +
> +	for_each_ring(engine, dev_priv, i) {
> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> +			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
> +
> +		ret = wait_for_register(dev_priv,
> +					RING_RESET_CTL(engine->mmio_base),
> +					RESET_CTL_READY_TO_RESET,
> +					RESET_CTL_READY_TO_RESET,
> +					700);
> +		if (ret) {

In a similar vein, we ignore ret here so just if (wait_for_register()) {
and not_ready: ...; return -EIO;
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18  8:50     ` Chris Wilson
@ 2015-06-18  9:51       ` Mika Kuoppala
  2015-06-18 10:03         ` Chris Wilson
  2015-06-18 10:11         ` Tomas Elf
  0 siblings, 2 replies; 21+ messages in thread
From: Mika Kuoppala @ 2015-06-18  9:51 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter

In order for gen8+ hardware to guarantee that no context switch
takes place during engine reset and that current context is properly
saved, the driver needs to notify and query hw before commencing
with reset.

There are gpu hangs where the engine gets so stuck that it never will
report to be ready for reset. We could proceed with reset anyway, but
with some hangs with skl, the forced gpu reset will result in a system
hang. By inspecting the unreadiness for reset seems to correlate with
the probable system hang.

We will only proceed with reset if all engines report that they
are ready for reset. If root cause for system hang is found and
can be worked around with another means, we can reconsider if
we can reinstate full reset for unreadiness case.

v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)
v3: updated commit msg
v4: timeout_ms, simpler error path (Chris)

References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
Testcase: igt/gem_concurrent_blit --r prw-blt-overwrite-source-read-rcs-forked
Testcase: igt/gem_concurrent_blit --r gtt-blt-overwrite-source-read-rcs-forked
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_reg.h     |  3 +++
 drivers/gpu/drm/i915/intel_uncore.c | 43 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 0b979ad..3684f92 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
 #define RING_MAX_IDLE(base)	((base)+0x54)
 #define RING_HWS_PGA(base)	((base)+0x80)
 #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
+#define RING_RESET_CTL(base)	((base)+0xd0)
+#define   RESET_CTL_REQUEST_RESET  (1 << 0)
+#define   RESET_CTL_READY_TO_RESET (1 << 1)
 
 #define HSW_GTT_CACHE_EN	0x4024
 #define   GTT_CACHE_EN_ALL	0xF0007FFF
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 4a86cf0..160a47a 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1455,9 +1455,50 @@ static int gen6_do_reset(struct drm_device *dev)
 	return ret;
 }
 
+static int wait_for_register(struct drm_i915_private *dev_priv,
+			     const u32 reg,
+			     const u32 mask,
+			     const u32 value,
+			     const unsigned long timeout_ms)
+{
+	return wait_for((I915_READ(reg) & mask) == value, timeout_ms);
+}
+
+static int gen8_do_reset(struct drm_device *dev)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct intel_engine_cs *engine;
+	int i;
+
+	for_each_ring(engine, dev_priv, i) {
+		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
+			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
+
+		if (wait_for_register(dev_priv,
+				      RING_RESET_CTL(engine->mmio_base),
+				      RESET_CTL_READY_TO_RESET,
+				      RESET_CTL_READY_TO_RESET,
+				      700)) {
+			DRM_ERROR("%s: reset request timeout\n", engine->name);
+			goto not_ready;
+		}
+	}
+
+	return gen6_do_reset(dev);
+
+not_ready:
+	for_each_ring(engine, dev_priv, i)
+		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
+			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
+
+	return -EIO;
+}
+
 static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
 {
-	if (INTEL_INFO(dev)->gen >= 6)
+	if (INTEL_INFO(dev)->gen >= 8)
+		return gen8_do_reset;
+	else if (INTEL_INFO(dev)->gen >= 6)
 		return gen6_do_reset;
 	else if (IS_GEN5(dev))
 		return ironlake_do_reset;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18  9:51       ` Mika Kuoppala
@ 2015-06-18 10:03         ` Chris Wilson
  2015-06-18 10:22           ` Mika Kuoppala
  2015-06-18 10:11         ` Tomas Elf
  1 sibling, 1 reply; 21+ messages in thread
From: Chris Wilson @ 2015-06-18 10:03 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: Daniel Vetter, intel-gfx

On Thu, Jun 18, 2015 at 12:51:40PM +0300, Mika Kuoppala wrote:
> In order for gen8+ hardware to guarantee that no context switch
> takes place during engine reset and that current context is properly
> saved, the driver needs to notify and query hw before commencing
> with reset.
> 
> There are gpu hangs where the engine gets so stuck that it never will
> report to be ready for reset. We could proceed with reset anyway, but
> with some hangs with skl, the forced gpu reset will result in a system
> hang. By inspecting the unreadiness for reset seems to correlate with
> the probable system hang.
> 
> We will only proceed with reset if all engines report that they
> are ready for reset. If root cause for system hang is found and
> can be worked around with another means, we can reconsider if
> we can reinstate full reset for unreadiness case.
> 
> v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)
> v3: updated commit msg
> v4: timeout_ms, simpler error path (Chris)
> 
> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
> Testcase: igt/gem_concurrent_blit --r prw-blt-overwrite-source-read-rcs-forked
> Testcase: igt/gem_concurrent_blit --r gtt-blt-overwrite-source-read-rcs-forked

Is this the new format for subtests?

I thought the form was
igt/gem_concurrent_blit/prw-blt-overwrite-source-read-rcs-forked

> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

Lgtm,
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18  9:51       ` Mika Kuoppala
  2015-06-18 10:03         ` Chris Wilson
@ 2015-06-18 10:11         ` Tomas Elf
  2015-06-18 10:31           ` Mika Kuoppala
  2015-06-18 10:36           ` Chris Wilson
  1 sibling, 2 replies; 21+ messages in thread
From: Tomas Elf @ 2015-06-18 10:11 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx; +Cc: Daniel Vetter

On 18/06/2015 10:51, Mika Kuoppala wrote:
> In order for gen8+ hardware to guarantee that no context switch
> takes place during engine reset and that current context is properly
> saved, the driver needs to notify and query hw before commencing
> with reset.
>
> There are gpu hangs where the engine gets so stuck that it never will
> report to be ready for reset. We could proceed with reset anyway, but
> with some hangs with skl, the forced gpu reset will result in a system
> hang. By inspecting the unreadiness for reset seems to correlate with
> the probable system hang.
>
> We will only proceed with reset if all engines report that they
> are ready for reset. If root cause for system hang is found and
> can be worked around with another means, we can reconsider if
> we can reinstate full reset for unreadiness case.
>
> v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)
> v3: updated commit msg
> v4: timeout_ms, simpler error path (Chris)
>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
> Testcase: igt/gem_concurrent_blit --r prw-blt-overwrite-source-read-rcs-forked
> Testcase: igt/gem_concurrent_blit --r gtt-blt-overwrite-source-read-rcs-forked
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_reg.h     |  3 +++
>   drivers/gpu/drm/i915/intel_uncore.c | 43 ++++++++++++++++++++++++++++++++++++-
>   2 files changed, 45 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 0b979ad..3684f92 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
>   #define RING_MAX_IDLE(base)	((base)+0x54)
>   #define RING_HWS_PGA(base)	((base)+0x80)
>   #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
> +#define RING_RESET_CTL(base)	((base)+0xd0)
> +#define   RESET_CTL_REQUEST_RESET  (1 << 0)
> +#define   RESET_CTL_READY_TO_RESET (1 << 1)
>
>   #define HSW_GTT_CACHE_EN	0x4024
>   #define   GTT_CACHE_EN_ALL	0xF0007FFF
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 4a86cf0..160a47a 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1455,9 +1455,50 @@ static int gen6_do_reset(struct drm_device *dev)
>   	return ret;
>   }
>
> +static int wait_for_register(struct drm_i915_private *dev_priv,
> +			     const u32 reg,
> +			     const u32 mask,
> +			     const u32 value,
> +			     const unsigned long timeout_ms)
> +{
> +	return wait_for((I915_READ(reg) & mask) == value, timeout_ms);
> +}
> +
> +static int gen8_do_reset(struct drm_device *dev)
> +{
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct intel_engine_cs *engine;
> +	int i;
> +
> +	for_each_ring(engine, dev_priv, i) {
> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> +			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
> +
> +		if (wait_for_register(dev_priv,
> +				      RING_RESET_CTL(engine->mmio_base),
> +				      RESET_CTL_READY_TO_RESET,
> +				      RESET_CTL_READY_TO_RESET,
> +				      700)) {
> +			DRM_ERROR("%s: reset request timeout\n", engine->name);
> +			goto not_ready;
> +		}

So just to be clear here: If one or more of the reset control registers 
decide that they are at a point where they will never again be ready for 
reset we will simply not do a full GPU reset until reboot? Is there 
perhaps a case where you would want to try reset request once or twice 
or like five times or whatever but then simply go ahead with the full 
GPU reset regardless of what the reset control register tells you? After 
all, it's our only way out if the hardware is truly stuck.

Thanks,
Tomas

> +	}
> +
> +	return gen6_do_reset(dev);
> +
> +not_ready:
> +	for_each_ring(engine, dev_priv, i)
> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> +			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
> +
> +	return -EIO;
> +}
> +
>   static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
>   {
> -	if (INTEL_INFO(dev)->gen >= 6)
> +	if (INTEL_INFO(dev)->gen >= 8)
> +		return gen8_do_reset;
> +	else if (INTEL_INFO(dev)->gen >= 6)
>   		return gen6_do_reset;
>   	else if (IS_GEN5(dev))
>   		return ironlake_do_reset;
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18 10:03         ` Chris Wilson
@ 2015-06-18 10:22           ` Mika Kuoppala
  2015-06-18 15:00             ` Daniel Vetter
  0 siblings, 1 reply; 21+ messages in thread
From: Mika Kuoppala @ 2015-06-18 10:22 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Daniel Vetter, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Thu, Jun 18, 2015 at 12:51:40PM +0300, Mika Kuoppala wrote:
>> In order for gen8+ hardware to guarantee that no context switch
>> takes place during engine reset and that current context is properly
>> saved, the driver needs to notify and query hw before commencing
>> with reset.
>> 
>> There are gpu hangs where the engine gets so stuck that it never will
>> report to be ready for reset. We could proceed with reset anyway, but
>> with some hangs with skl, the forced gpu reset will result in a system
>> hang. By inspecting the unreadiness for reset seems to correlate with
>> the probable system hang.
>> 
>> We will only proceed with reset if all engines report that they
>> are ready for reset. If root cause for system hang is found and
>> can be worked around with another means, we can reconsider if
>> we can reinstate full reset for unreadiness case.
>> 
>> v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)
>> v3: updated commit msg
>> v4: timeout_ms, simpler error path (Chris)
>> 
>> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
>> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
>> Testcase: igt/gem_concurrent_blit --r prw-blt-overwrite-source-read-rcs-forked
>> Testcase: igt/gem_concurrent_blit --r gtt-blt-overwrite-source-read-rcs-forked
>
> Is this the new format for subtests?

No. It is me cutpasting from scripts. Daniel could you please
fix while merging.

Thanks,
-Mika

> I thought the form was
> igt/gem_concurrent_blit/prw-blt-overwrite-source-read-rcs-forked
>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Cc: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>
> Lgtm,
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18 10:11         ` Tomas Elf
@ 2015-06-18 10:31           ` Mika Kuoppala
  2015-06-18 10:36           ` Chris Wilson
  1 sibling, 0 replies; 21+ messages in thread
From: Mika Kuoppala @ 2015-06-18 10:31 UTC (permalink / raw)
  To: Tomas Elf, intel-gfx; +Cc: Daniel Vetter

Tomas Elf <tomas.elf@intel.com> writes:

> On 18/06/2015 10:51, Mika Kuoppala wrote:
>> In order for gen8+ hardware to guarantee that no context switch
>> takes place during engine reset and that current context is properly
>> saved, the driver needs to notify and query hw before commencing
>> with reset.
>>
>> There are gpu hangs where the engine gets so stuck that it never will
>> report to be ready for reset. We could proceed with reset anyway, but
>> with some hangs with skl, the forced gpu reset will result in a system
>> hang. By inspecting the unreadiness for reset seems to correlate with
>> the probable system hang.
>>
>> We will only proceed with reset if all engines report that they
>> are ready for reset. If root cause for system hang is found and
>> can be worked around with another means, we can reconsider if
>> we can reinstate full reset for unreadiness case.
>>
>> v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)
>> v3: updated commit msg
>> v4: timeout_ms, simpler error path (Chris)
>>
>> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
>> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
>> Testcase: igt/gem_concurrent_blit --r prw-blt-overwrite-source-read-rcs-forked
>> Testcase: igt/gem_concurrent_blit --r gtt-blt-overwrite-source-read-rcs-forked
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Cc: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_reg.h     |  3 +++
>>   drivers/gpu/drm/i915/intel_uncore.c | 43 ++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 45 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index 0b979ad..3684f92 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
>>   #define RING_MAX_IDLE(base)	((base)+0x54)
>>   #define RING_HWS_PGA(base)	((base)+0x80)
>>   #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
>> +#define RING_RESET_CTL(base)	((base)+0xd0)
>> +#define   RESET_CTL_REQUEST_RESET  (1 << 0)
>> +#define   RESET_CTL_READY_TO_RESET (1 << 1)
>>
>>   #define HSW_GTT_CACHE_EN	0x4024
>>   #define   GTT_CACHE_EN_ALL	0xF0007FFF
>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> index 4a86cf0..160a47a 100644
>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> @@ -1455,9 +1455,50 @@ static int gen6_do_reset(struct drm_device *dev)
>>   	return ret;
>>   }
>>
>> +static int wait_for_register(struct drm_i915_private *dev_priv,
>> +			     const u32 reg,
>> +			     const u32 mask,
>> +			     const u32 value,
>> +			     const unsigned long timeout_ms)
>> +{
>> +	return wait_for((I915_READ(reg) & mask) == value, timeout_ms);
>> +}
>> +
>> +static int gen8_do_reset(struct drm_device *dev)
>> +{
>> +	struct drm_i915_private *dev_priv = dev->dev_private;
>> +	struct intel_engine_cs *engine;
>> +	int i;
>> +
>> +	for_each_ring(engine, dev_priv, i) {
>> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
>> +			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>> +
>> +		if (wait_for_register(dev_priv,
>> +				      RING_RESET_CTL(engine->mmio_base),
>> +				      RESET_CTL_READY_TO_RESET,
>> +				      RESET_CTL_READY_TO_RESET,
>> +				      700)) {
>> +			DRM_ERROR("%s: reset request timeout\n", engine->name);
>> +			goto not_ready;
>> +		}
>
> So just to be clear here: If one or more of the reset control registers 
> decide that they are at a point where they will never again be ready for 
> reset we will simply not do a full GPU reset until reboot? 

Correct. Atleast for now until we find out what upsets the engine
so much that resetting it hangs the system. So for now it is just
a choise between dead gpu or dead system.

>Is there 
> perhaps a case where you would want to try reset request once or twice 
> or like five times or whatever but then simply go ahead with the full 
> GPU reset regardless of what the reset control register tells you? After 
> all, it's our only way out if the hardware is truly stuck.
>

That would be the best if we could count that the reset only resets
the GPU. Then we would risk just losing/messing the context (and
only with per ring resets).

But until we learn more of this situation, we risk hanging the
whole system by trying to revive the gpu. I tried to update
the commit message to reflect this.

-Mika

> Thanks,
> Tomas
>
>> +	}
>> +
>> +	return gen6_do_reset(dev);
>> +
>> +not_ready:
>> +	for_each_ring(engine, dev_priv, i)
>> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
>> +			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
>> +
>> +	return -EIO;
>> +}
>> +
>>   static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
>>   {
>> -	if (INTEL_INFO(dev)->gen >= 6)
>> +	if (INTEL_INFO(dev)->gen >= 8)
>> +		return gen8_do_reset;
>> +	else if (INTEL_INFO(dev)->gen >= 6)
>>   		return gen6_do_reset;
>>   	else if (IS_GEN5(dev))
>>   		return ironlake_do_reset;
>>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18 10:11         ` Tomas Elf
  2015-06-18 10:31           ` Mika Kuoppala
@ 2015-06-18 10:36           ` Chris Wilson
  2015-06-18 11:18             ` Tomas Elf
  1 sibling, 1 reply; 21+ messages in thread
From: Chris Wilson @ 2015-06-18 10:36 UTC (permalink / raw)
  To: Tomas Elf; +Cc: intel-gfx, Daniel Vetter

On Thu, Jun 18, 2015 at 11:11:55AM +0100, Tomas Elf wrote:
> On 18/06/2015 10:51, Mika Kuoppala wrote:
> >In order for gen8+ hardware to guarantee that no context switch
> >takes place during engine reset and that current context is properly
> >saved, the driver needs to notify and query hw before commencing
> >with reset.
> >
> >There are gpu hangs where the engine gets so stuck that it never will
> >report to be ready for reset. We could proceed with reset anyway, but
> >with some hangs with skl, the forced gpu reset will result in a system
> >hang. By inspecting the unreadiness for reset seems to correlate with
> >the probable system hang.
> >
> >We will only proceed with reset if all engines report that they
> >are ready for reset. If root cause for system hang is found and
> >can be worked around with another means, we can reconsider if
> >we can reinstate full reset for unreadiness case.
> >
> >v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)
> >v3: updated commit msg
> >v4: timeout_ms, simpler error path (Chris)
> >
> >References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
> >References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
> >Testcase: igt/gem_concurrent_blit --r prw-blt-overwrite-source-read-rcs-forked
> >Testcase: igt/gem_concurrent_blit --r gtt-blt-overwrite-source-read-rcs-forked
> >Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >Cc: Tomas Elf <tomas.elf@intel.com>
> >Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> >---
> >  drivers/gpu/drm/i915/i915_reg.h     |  3 +++
> >  drivers/gpu/drm/i915/intel_uncore.c | 43 ++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 45 insertions(+), 1 deletion(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> >index 0b979ad..3684f92 100644
> >--- a/drivers/gpu/drm/i915/i915_reg.h
> >+++ b/drivers/gpu/drm/i915/i915_reg.h
> >@@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
> >  #define RING_MAX_IDLE(base)	((base)+0x54)
> >  #define RING_HWS_PGA(base)	((base)+0x80)
> >  #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
> >+#define RING_RESET_CTL(base)	((base)+0xd0)
> >+#define   RESET_CTL_REQUEST_RESET  (1 << 0)
> >+#define   RESET_CTL_READY_TO_RESET (1 << 1)
> >
> >  #define HSW_GTT_CACHE_EN	0x4024
> >  #define   GTT_CACHE_EN_ALL	0xF0007FFF
> >diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> >index 4a86cf0..160a47a 100644
> >--- a/drivers/gpu/drm/i915/intel_uncore.c
> >+++ b/drivers/gpu/drm/i915/intel_uncore.c
> >@@ -1455,9 +1455,50 @@ static int gen6_do_reset(struct drm_device *dev)
> >  	return ret;
> >  }
> >
> >+static int wait_for_register(struct drm_i915_private *dev_priv,
> >+			     const u32 reg,
> >+			     const u32 mask,
> >+			     const u32 value,
> >+			     const unsigned long timeout_ms)
> >+{
> >+	return wait_for((I915_READ(reg) & mask) == value, timeout_ms);
> >+}
> >+
> >+static int gen8_do_reset(struct drm_device *dev)
> >+{
> >+	struct drm_i915_private *dev_priv = dev->dev_private;
> >+	struct intel_engine_cs *engine;
> >+	int i;
> >+
> >+	for_each_ring(engine, dev_priv, i) {
> >+		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> >+			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
> >+
> >+		if (wait_for_register(dev_priv,
> >+				      RING_RESET_CTL(engine->mmio_base),
> >+				      RESET_CTL_READY_TO_RESET,
> >+				      RESET_CTL_READY_TO_RESET,
> >+				      700)) {
> >+			DRM_ERROR("%s: reset request timeout\n", engine->name);
> >+			goto not_ready;
> >+		}
> 
> So just to be clear here: If one or more of the reset control
> registers decide that they are at a point where they will never
> again be ready for reset we will simply not do a full GPU reset
> until reboot? Is there perhaps a case where you would want to try
> reset request once or twice or like five times or whatever but then
> simply go ahead with the full GPU reset regardless of what the reset
> control register tells you? After all, it's our only way out if the
> hardware is truly stuck.

What happens is that we skip the reset, report an error and that marks
the GPU as wedged. To get out of that state requires user intervention,
either by rebooting or through use of debugfs/i915_wedged.

We can try to repeat the reset from a workqueue, but we should first
tackle interaction with TDR first and get your per-engine reset
upstream, along with it's various levels of backoff and recovery.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18 10:36           ` Chris Wilson
@ 2015-06-18 11:18             ` Tomas Elf
  2015-06-18 11:42               ` Chris Wilson
  0 siblings, 1 reply; 21+ messages in thread
From: Tomas Elf @ 2015-06-18 11:18 UTC (permalink / raw)
  To: Chris Wilson, Mika Kuoppala, intel-gfx, Daniel Vetter

On 18/06/2015 11:36, Chris Wilson wrote:> On Thu, Jun 18, 2015 at 
11:11:55AM +0100, Tomas Elf wrote:
 >> On 18/06/2015 10:51, Mika Kuoppala wrote:
 >>> In order for gen8+ hardware to guarantee that no context switch
 >>> takes place during engine reset and that current context is properly
 >>> saved, the driver needs to notify and query hw before commencing
 >>> with reset.
 >>>
 >>> There are gpu hangs where the engine gets so stuck that it never will
 >>> report to be ready for reset. We could proceed with reset anyway, but
 >>> with some hangs with skl, the forced gpu reset will result in a system
 >>> hang. By inspecting the unreadiness for reset seems to correlate with
 >>> the probable system hang.
 >>>
 >>> We will only proceed with reset if all engines report that they
 >>> are ready for reset. If root cause for system hang is found and
 >>> can be worked around with another means, we can reconsider if
 >>> we can reinstate full reset for unreadiness case.
 >>>
 >>> v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)
 >>> v3: updated commit msg
 >>> v4: timeout_ms, simpler error path (Chris)
 >>>
 >>> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
 >>> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
 >>> Testcase: igt/gem_concurrent_blit --r 
prw-blt-overwrite-source-read-rcs-forked
 >>> Testcase: igt/gem_concurrent_blit --r 
gtt-blt-overwrite-source-read-rcs-forked
 >>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
 >>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
 >>> Cc: Tomas Elf <tomas.elf@intel.com>
 >>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
 >>> ---
 >>>   drivers/gpu/drm/i915/i915_reg.h     |  3 +++
 >>>   drivers/gpu/drm/i915/intel_uncore.c | 43 
++++++++++++++++++++++++++++++++++++-
 >>>   2 files changed, 45 insertions(+), 1 deletion(-)
 >>>
 >>> diff --git a/drivers/gpu/drm/i915/i915_reg.h 
b/drivers/gpu/drm/i915/i915_reg.h
 >>> index 0b979ad..3684f92 100644
 >>> --- a/drivers/gpu/drm/i915/i915_reg.h
 >>> +++ b/drivers/gpu/drm/i915/i915_reg.h
 >>> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
 >>>   #define RING_MAX_IDLE(base)	((base)+0x54)
 >>>   #define RING_HWS_PGA(base)	((base)+0x80)
 >>>   #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
 >>> +#define RING_RESET_CTL(base)	((base)+0xd0)
 >>> +#define   RESET_CTL_REQUEST_RESET  (1 << 0)
 >>> +#define   RESET_CTL_READY_TO_RESET (1 << 1)
 >>>
 >>>   #define HSW_GTT_CACHE_EN	0x4024
 >>>   #define   GTT_CACHE_EN_ALL	0xF0007FFF
 >>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c 
b/drivers/gpu/drm/i915/intel_uncore.c
 >>> index 4a86cf0..160a47a 100644
 >>> --- a/drivers/gpu/drm/i915/intel_uncore.c
 >>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
 >>> @@ -1455,9 +1455,50 @@ static int gen6_do_reset(struct drm_device *dev)
 >>>   	return ret;
 >>>   }
 >>>
 >>> +static int wait_for_register(struct drm_i915_private *dev_priv,
 >>> +			     const u32 reg,
 >>> +			     const u32 mask,
 >>> +			     const u32 value,
 >>> +			     const unsigned long timeout_ms)
 >>> +{
 >>> +	return wait_for((I915_READ(reg) & mask) == value, timeout_ms);
 >>> +}
 >>> +
 >>> +static int gen8_do_reset(struct drm_device *dev)
 >>> +{
 >>> +	struct drm_i915_private *dev_priv = dev->dev_private;
 >>> +	struct intel_engine_cs *engine;
 >>> +	int i;
 >>> +
 >>> +	for_each_ring(engine, dev_priv, i) {
 >>> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
 >>> +			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
 >>> +
 >>> +		if (wait_for_register(dev_priv,
 >>> +				      RING_RESET_CTL(engine->mmio_base),
 >>> +				      RESET_CTL_READY_TO_RESET,
 >>> +				      RESET_CTL_READY_TO_RESET,
 >>> +				      700)) {
 >>> +			DRM_ERROR("%s: reset request timeout\n", engine->name);
 >>> +			goto not_ready;
 >>> +		}
 >>
 >> So just to be clear here: If one or more of the reset control
 >> registers decide that they are at a point where they will never
 >> again be ready for reset we will simply not do a full GPU reset
 >> until reboot? Is there perhaps a case where you would want to try
 >> reset request once or twice or like five times or whatever but then
 >> simply go ahead with the full GPU reset regardless of what the reset
 >> control register tells you? After all, it's our only way out if the
 >> hardware is truly stuck.
 >
 > What happens is that we skip the reset, report an error and that marks
 > the GPU as wedged. To get out of that state requires user intervention,
 > either by rebooting or through use of debugfs/i915_wedged.

That's a fair point, we will mark the GPU as terminally wedged. That's 
always been there as a final state where we simply give up. I guess it 
might be better to actively mark the GPU as terminally wedged from the 
driver's point of view rather than plow ahead in a last ditch effort to 
reset the GPU, which may or may not succeed and which may irrecoverably 
hang the system in the worst case. I guess we at least protect the 
currently running context if we just mark the GPU as terminally wedged 
instead of putting it in a potentially undefined state.

 >
 > We can try to repeat the reset from a workqueue, but we should first
 > tackle interaction with TDR first and get your per-engine reset
 > upstream, along with it's various levels of backoff and recovery.
 > -Chris

My point was more along the lines of bailing out if the reset request 
fails and not return an error message but simply keep track of the 
number of times we've attempted the reset request. By not returning an 
error we would allow more subsequent hang detections to happen (since 
the hang is still there), which would end up in the same reset request 
in the future. If the reset request would fail more times we would 
simply increment the counter and at one point we would decide that we've 
had too many unsuccessful reset request attempts and simply go ahead 
with the reset anyway and if the reset would fail we would return an 
error at that point in time, which would result in a terminally wedged 
state. But, yeah, I can see why we shouldn't do this.

We could certainly introduce per-engine reset support into this to add 
more levels of recovery and fall-back but in the end if we use reset 
handshaking for both per-engine reset and for full GPU reset and if 
reset handshaking fails in both cases then we're screwed no matter what 
(so we try engine reset request and fail, then fall back to full GPU 
reset request and fail there too - terminally wedged!). The reset 
request failure will block both per-engine reset and full GPU reset and 
result in a terminally wedged state no matter what.

The only thing we gain in this particular case by adding per-engine 
reset support is if the reset request failure is limited to the blitter 
engine (which Ben Widawsky seems to be questioning on IRC). In that 
case, supporting per-engine reset support would allow us to unblock 
other engines separately without touching full GPU reset and thereby not 
having to request blitter engine reset, avoiding the potential case of 
having the blitter engine reset request fail, which would thereby block 
any other hang recovery for all engines.

Anyway, if we prefer the terminally wedged state rather than a last 
ditch attempt at a full GPU reset then I can understand how this makes 
sense.

Thanks,
Tomas


 >

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18 11:18             ` Tomas Elf
@ 2015-06-18 11:42               ` Chris Wilson
  2015-06-18 14:58                 ` Daniel Vetter
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Wilson @ 2015-06-18 11:42 UTC (permalink / raw)
  To: Tomas Elf; +Cc: intel-gfx, Daniel Vetter

On Thu, Jun 18, 2015 at 12:18:39PM +0100, Tomas Elf wrote:
> My point was more along the lines of bailing out if the reset
> request fails and not return an error message but simply keep track
> of the number of times we've attempted the reset request. By not
> returning an error we would allow more subsequent hang detections to
> happen (since the hang is still there), which would end up in the
> same reset request in the future. If the reset request would fail
> more times we would simply increment the counter and at one point we
> would decide that we've had too many unsuccessful reset request
> attempts and simply go ahead with the reset anyway and if the reset
> would fail we would return an error at that point in time, which
> would result in a terminally wedged state. But, yeah, I can see why
> we shouldn't do this.

Skipping to the middle!

I understand the merit in trying the reset a few times before giving up,
it would just need a bit of restructuring to try the reset before
clearing gem state (trivial) and requeueing the hangcheck. I am just
wary of feature creep before we get stuck into TDR, which promises to
change how we think about resets entirely.

I am trying not to block your work by doing "it would be nice if" tasks
first! :)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18 11:42               ` Chris Wilson
@ 2015-06-18 14:58                 ` Daniel Vetter
  2015-06-19 16:30                   ` Chris Wilson
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel Vetter @ 2015-06-18 14:58 UTC (permalink / raw)
  To: Chris Wilson, Tomas Elf, Mika Kuoppala, intel-gfx, Daniel Vetter

On Thu, Jun 18, 2015 at 12:42:55PM +0100, Chris Wilson wrote:
> On Thu, Jun 18, 2015 at 12:18:39PM +0100, Tomas Elf wrote:
> > My point was more along the lines of bailing out if the reset
> > request fails and not return an error message but simply keep track
> > of the number of times we've attempted the reset request. By not
> > returning an error we would allow more subsequent hang detections to
> > happen (since the hang is still there), which would end up in the
> > same reset request in the future. If the reset request would fail
> > more times we would simply increment the counter and at one point we
> > would decide that we've had too many unsuccessful reset request
> > attempts and simply go ahead with the reset anyway and if the reset
> > would fail we would return an error at that point in time, which
> > would result in a terminally wedged state. But, yeah, I can see why
> > we shouldn't do this.
> 
> Skipping to the middle!
> 
> I understand the merit in trying the reset a few times before giving up,
> it would just need a bit of restructuring to try the reset before
> clearing gem state (trivial) and requeueing the hangcheck. I am just
> wary of feature creep before we get stuck into TDR, which promises to
> change how we think about resets entirely.

My maintainer concern here is always that we should err on the side of not
killing the machine. If the reset failed, or if the gpu reinit failed then
marking the gpu as wedged has historically been the safe option. The
system will still run, display mostly works and there's a reasonable
chance you can gather debug data.

We do have i915.reset to disable the reset for these cases, but it's
always a nuisance to have to resort to that.
-Daneil
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18 10:22           ` Mika Kuoppala
@ 2015-06-18 15:00             ` Daniel Vetter
  0 siblings, 0 replies; 21+ messages in thread
From: Daniel Vetter @ 2015-06-18 15:00 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: Daniel Vetter, intel-gfx

On Thu, Jun 18, 2015 at 01:22:36PM +0300, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > On Thu, Jun 18, 2015 at 12:51:40PM +0300, Mika Kuoppala wrote:
> >> In order for gen8+ hardware to guarantee that no context switch
> >> takes place during engine reset and that current context is properly
> >> saved, the driver needs to notify and query hw before commencing
> >> with reset.
> >> 
> >> There are gpu hangs where the engine gets so stuck that it never will
> >> report to be ready for reset. We could proceed with reset anyway, but
> >> with some hangs with skl, the forced gpu reset will result in a system
> >> hang. By inspecting the unreadiness for reset seems to correlate with
> >> the probable system hang.
> >> 
> >> We will only proceed with reset if all engines report that they
> >> are ready for reset. If root cause for system hang is found and
> >> can be worked around with another means, we can reconsider if
> >> we can reinstate full reset for unreadiness case.
> >> 
> >> v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)
> >> v3: updated commit msg
> >> v4: timeout_ms, simpler error path (Chris)
> >> 
> >> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
> >> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
> >> Testcase: igt/gem_concurrent_blit --r prw-blt-overwrite-source-read-rcs-forked
> >> Testcase: igt/gem_concurrent_blit --r gtt-blt-overwrite-source-read-rcs-forked
> >
> > Is this the new format for subtests?
> 
> No. It is me cutpasting from scripts. Daniel could you please
> fix while merging.

Done and queued for -next, thanks for the patch.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-18 14:58                 ` Daniel Vetter
@ 2015-06-19 16:30                   ` Chris Wilson
  2015-06-22 12:50                     ` Daniel Vetter
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Wilson @ 2015-06-19 16:30 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, intel-gfx

On Thu, Jun 18, 2015 at 04:58:06PM +0200, Daniel Vetter wrote:
> On Thu, Jun 18, 2015 at 12:42:55PM +0100, Chris Wilson wrote:
> > I understand the merit in trying the reset a few times before giving up,
> > it would just need a bit of restructuring to try the reset before
> > clearing gem state (trivial) and requeueing the hangcheck. I am just
> > wary of feature creep before we get stuck into TDR, which promises to
> > change how we think about resets entirely.
> 
> My maintainer concern here is always that we should err on the side of not
> killing the machine. If the reset failed, or if the gpu reinit failed then
> marking the gpu as wedged has historically been the safe option. The
> system will still run, display mostly works and there's a reasonable
> chance you can gather debug data.

One thing to bear in mind here is that it with this particular don't
reset if not ready logic, repeating the attempt at reset after another
hangcheck is equivalent to just using a slower hangcheck. (more or less,
a couple of writes to one register difference) So it is no more likely
to hang the machine than the original GPU hang.

We can differentiate the cases here, between say EBUSY, ENODEV, and EIO,
from the actual the reset request to determine which we want to retry
(i.e. EBUSY).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] drm/i915: Reset request handling for gen8+
  2015-06-19 16:30                   ` Chris Wilson
@ 2015-06-22 12:50                     ` Daniel Vetter
  0 siblings, 0 replies; 21+ messages in thread
From: Daniel Vetter @ 2015-06-22 12:50 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Tomas Elf, Mika Kuoppala, intel-gfx,
	Daniel Vetter

On Fri, Jun 19, 2015 at 05:30:45PM +0100, Chris Wilson wrote:
> On Thu, Jun 18, 2015 at 04:58:06PM +0200, Daniel Vetter wrote:
> > On Thu, Jun 18, 2015 at 12:42:55PM +0100, Chris Wilson wrote:
> > > I understand the merit in trying the reset a few times before giving up,
> > > it would just need a bit of restructuring to try the reset before
> > > clearing gem state (trivial) and requeueing the hangcheck. I am just
> > > wary of feature creep before we get stuck into TDR, which promises to
> > > change how we think about resets entirely.
> > 
> > My maintainer concern here is always that we should err on the side of not
> > killing the machine. If the reset failed, or if the gpu reinit failed then
> > marking the gpu as wedged has historically been the safe option. The
> > system will still run, display mostly works and there's a reasonable
> > chance you can gather debug data.
> 
> One thing to bear in mind here is that it with this particular don't
> reset if not ready logic, repeating the attempt at reset after another
> hangcheck is equivalent to just using a slower hangcheck. (more or less,
> a couple of writes to one register difference) So it is no more likely
> to hang the machine than the original GPU hang.
> 
> We can differentiate the cases here, between say EBUSY, ENODEV, and EIO,
> from the actual the reset request to determine which we want to retry
> (i.e. EBUSY).

Tbh I don't want to make the reset code to clever with multiple fallback
paths - it's a really tricky code and as-is already suffers from imo
insufficient test coverage and too many bugs. Once we decided that the gpu
is dead and return -EIO this should be a terminal state. Developers can
always manually unwedge through debugfs, but for users it's imo paramount
that we don't automatically run some little-tested path and take down
their box in the process.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-06-22 12:47 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-16 13:39 [PATCH 1/1] drm/i915: Reset request handling for gen9+ Mika Kuoppala
2015-06-16 14:09 ` Chris Wilson
2015-06-16 17:10 ` Chris Wilson
2015-06-16 20:15   ` Tomas Elf
2015-06-17  6:33     ` Mika Kuoppala
2015-06-16 19:57 ` Tomas Elf
2015-06-17 12:35 ` [PATCH] drm/i915: Reset request handling for gen8+ Mika Kuoppala
2015-06-18  8:36   ` Mika Kuoppala
2015-06-18  8:50     ` Chris Wilson
2015-06-18  9:51       ` Mika Kuoppala
2015-06-18 10:03         ` Chris Wilson
2015-06-18 10:22           ` Mika Kuoppala
2015-06-18 15:00             ` Daniel Vetter
2015-06-18 10:11         ` Tomas Elf
2015-06-18 10:31           ` Mika Kuoppala
2015-06-18 10:36           ` Chris Wilson
2015-06-18 11:18             ` Tomas Elf
2015-06-18 11:42               ` Chris Wilson
2015-06-18 14:58                 ` Daniel Vetter
2015-06-19 16:30                   ` Chris Wilson
2015-06-22 12:50                     ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.