Re: [RFC 10/11] drm/i915: Debugfs interface for per-engine hang recovery.

From: Tomas Elf <tomas.elf@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "Intel-GFX@Lists.FreeDesktop.Org"
	<Intel-GFX@Lists.FreeDesktop.Org>,
	Ian Lister <ian.lister@intel.com>
Subject: Re: [RFC 10/11] drm/i915: Debugfs interface for per-engine hang recovery.
Date: Tue, 09 Jun 2015 12:18:28 +0100	[thread overview]
Message-ID: <5576CB84.7020204@intel.com> (raw)
In-Reply-To: <20150608174555.GH11457@nuc-i3427.alporthouse.com>

On 08/06/2015 18:45, Chris Wilson wrote:
> On Mon, Jun 08, 2015 at 06:03:28PM +0100, Tomas Elf wrote:
>> 1. The i915_wedged_set function allows us to schedule three forms of hang recovery:
>>
>> 	a) Legacy hang recovery: By passing e.g. -1 we trigger the legacy full
>> 	GPU reset recovery path.
>>
>> 	b) Single engine hang recovery: By passing an engine ID in the interval
>> 	of [0, I915_NUM_RINGS) we can schedule hang recovery of any single
>> 	engine assuming that the context submission consistency requirements
>> 	are met (otherwise the hang recovery path will simply exit early and
>> 	wait for another hang detection). The values are assumed to use up bits
>> 	3:0 only since we certainly do not support as many as 16 engines.
>>
>> 	This mode is supported since there are several legacy test applications
>> 	that rely on this interface.
>
> Are there? I don't see them in igt - and let's not start making debugfs
> ABI.

They're not in IGT only internal to VPG. I guess we could limit these 
changes and adapt the internal test suite in VPG instead of upstreaming 
changes that only VPG validation cares about.

>
>> 	c) Multiple engine hang recovery: By passing in an engine flag mask in
>> 	bits 31:8 (bit 8 corresponds to engine 0 = RCS, bit 9 corresponds to
>> 	engine 1 = VCS etc) we can schedule any combination of engine hang
>> 	recoveries as we please. For example, by passing in the value 0x3 << 8
>> 	we would schedule hang recovery for engines 0 and 1 (RCS and VCS) at
>> 	the same time.
>
> Seems fine. But I don't see the reason for the extra complication.

I wanted to make sure that we could test multiple concurrent hang 
recoveries, but to be fair nobody is actually using this at the moment 
so unless someone actually _needs_ this we probably don't need to 
upstream it.

I guess we could leave it in its currently upstreamed form where it only 
allows full GPU reset. Or would it be of use to anyone to support 
per-engine recovery?

>
>> 	If bits in fields 3:0 and 31:8 are both used then single engine hang
>> 	recovery mode takes presidence and bits 31:8 are ignored.
>>
>> 2. The i915_wedged_get function produces a set of statistics related to:
>
> Add it to hangcheck_info instead.

Yeah, I considered that but I felt that hangcheck_info had too much text 
and it would be too much of a hassle to parse out the data. But having 
spoken to the validation guys it seems like they're fine with updating 
the parser. So I could update hangcheck_info with this new information.

>
> i915_wedged_get could be updated to give the ring mask of wedged rings?
> If that concept exists.
> -Chris
>

Nah, no need, I'll just add the information to hangcheck_info. Besides, 
wedged_get needs to provide more information than just the current 
wedged state. It also provides information about the number of resets, 
the number of watchdog timeouts etc. So it's not that easy to summarise 
it as a ring mask.

Thanks,
Tomas

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx