Re: [PATCH] drm/i915: Reset request handling for gen8+

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
To: Tomas Elf <tomas.elf@intel.com>, intel-gfx@lists.freedesktop.org
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Subject: Re: [PATCH] drm/i915: Reset request handling for gen8+
Date: Thu, 18 Jun 2015 13:31:42 +0300	[thread overview]
Message-ID: <87vbelpasx.fsf@gaia.fi.intel.com> (raw)
In-Reply-To: <5582996B.70606@intel.com>

Tomas Elf <tomas.elf@intel.com> writes:

> On 18/06/2015 10:51, Mika Kuoppala wrote:
>> In order for gen8+ hardware to guarantee that no context switch
>> takes place during engine reset and that current context is properly
>> saved, the driver needs to notify and query hw before commencing
>> with reset.
>>
>> There are gpu hangs where the engine gets so stuck that it never will
>> report to be ready for reset. We could proceed with reset anyway, but
>> with some hangs with skl, the forced gpu reset will result in a system
>> hang. By inspecting the unreadiness for reset seems to correlate with
>> the probable system hang.
>>
>> We will only proceed with reset if all engines report that they
>> are ready for reset. If root cause for system hang is found and
>> can be worked around with another means, we can reconsider if
>> we can reinstate full reset for unreadiness case.
>>
>> v2: -EIO, Recovery, gen8 (Chris, Tomas, Daniel)
>> v3: updated commit msg
>> v4: timeout_ms, simpler error path (Chris)
>>
>> References: https://bugs.freedesktop.org/show_bug.cgi?id=89959
>> References: https://bugs.freedesktop.org/show_bug.cgi?id=90854
>> Testcase: igt/gem_concurrent_blit --r prw-blt-overwrite-source-read-rcs-forked
>> Testcase: igt/gem_concurrent_blit --r gtt-blt-overwrite-source-read-rcs-forked
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Cc: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_reg.h     |  3 +++
>>   drivers/gpu/drm/i915/intel_uncore.c | 43 ++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 45 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
>> index 0b979ad..3684f92 100644
>> --- a/drivers/gpu/drm/i915/i915_reg.h
>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>> @@ -1461,6 +1461,9 @@ enum skl_disp_power_wells {
>>   #define RING_MAX_IDLE(base)	((base)+0x54)
>>   #define RING_HWS_PGA(base)	((base)+0x80)
>>   #define RING_HWS_PGA_GEN6(base)	((base)+0x2080)
>> +#define RING_RESET_CTL(base)	((base)+0xd0)
>> +#define   RESET_CTL_REQUEST_RESET  (1 << 0)
>> +#define   RESET_CTL_READY_TO_RESET (1 << 1)
>>
>>   #define HSW_GTT_CACHE_EN	0x4024
>>   #define   GTT_CACHE_EN_ALL	0xF0007FFF
>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> index 4a86cf0..160a47a 100644
>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> @@ -1455,9 +1455,50 @@ static int gen6_do_reset(struct drm_device *dev)
>>   	return ret;
>>   }
>>
>> +static int wait_for_register(struct drm_i915_private *dev_priv,
>> +			     const u32 reg,
>> +			     const u32 mask,
>> +			     const u32 value,
>> +			     const unsigned long timeout_ms)
>> +{
>> +	return wait_for((I915_READ(reg) & mask) == value, timeout_ms);
>> +}
>> +
>> +static int gen8_do_reset(struct drm_device *dev)
>> +{
>> +	struct drm_i915_private *dev_priv = dev->dev_private;
>> +	struct intel_engine_cs *engine;
>> +	int i;
>> +
>> +	for_each_ring(engine, dev_priv, i) {
>> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
>> +			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>> +
>> +		if (wait_for_register(dev_priv,
>> +				      RING_RESET_CTL(engine->mmio_base),
>> +				      RESET_CTL_READY_TO_RESET,
>> +				      RESET_CTL_READY_TO_RESET,
>> +				      700)) {
>> +			DRM_ERROR("%s: reset request timeout\n", engine->name);
>> +			goto not_ready;
>> +		}
>
> So just to be clear here: If one or more of the reset control registers 
> decide that they are at a point where they will never again be ready for 
> reset we will simply not do a full GPU reset until reboot? 

Correct. Atleast for now until we find out what upsets the engine
so much that resetting it hangs the system. So for now it is just
a choise between dead gpu or dead system.

>Is there 
> perhaps a case where you would want to try reset request once or twice 
> or like five times or whatever but then simply go ahead with the full 
> GPU reset regardless of what the reset control register tells you? After 
> all, it's our only way out if the hardware is truly stuck.
>

That would be the best if we could count that the reset only resets
the GPU. Then we would risk just losing/messing the context (and
only with per ring resets).

But until we learn more of this situation, we risk hanging the
whole system by trying to revive the gpu. I tried to update
the commit message to reflect this.

-Mika

> Thanks,
> Tomas
>
>> +	}
>> +
>> +	return gen6_do_reset(dev);
>> +
>> +not_ready:
>> +	for_each_ring(engine, dev_priv, i)
>> +		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
>> +			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
>> +
>> +	return -EIO;
>> +}
>> +
>>   static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
>>   {
>> -	if (INTEL_INFO(dev)->gen >= 6)
>> +	if (INTEL_INFO(dev)->gen >= 8)
>> +		return gen8_do_reset;
>> +	else if (INTEL_INFO(dev)->gen >= 6)
>>   		return gen6_do_reset;
>>   	else if (IS_GEN5(dev))
>>   		return ironlake_do_reset;
>>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx