All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Gordon <david.s.gordon@intel.com>
To: intel-gfx@lists.freedesktop.org
Subject: Re: [RFC 03/11] drm/i915: Add reset stats entry point for per-engine reset.
Date: Thu, 18 Jun 2015 12:12:00 +0100	[thread overview]
Message-ID: <5582A780.7080506@intel.com> (raw)
In-Reply-To: <20150616135449.GE11933@nuc-i3427.alporthouse.com>

On 16/06/15 14:54, Chris Wilson wrote:
> On Tue, Jun 16, 2015 at 03:48:09PM +0200, Daniel Vetter wrote:
>> On Mon, Jun 08, 2015 at 06:33:59PM +0100, Chris Wilson wrote:
>>> On Mon, Jun 08, 2015 at 06:03:21PM +0100, Tomas Elf wrote:
>>>> In preparation for per-engine reset add way for setting context reset stats.
>>>>
>>>> OPEN QUESTIONS:
>>>> 1. How do we deal with get_reset_stats and the GL robustness interface when
>>>> introducing per-engine resets?
>>>>
>>>> 	a. Do we set context that cause per-engine resets as guilty? If so, how
>>>> 	does this affect context banning?
>>>
>>> Yes. If the reset works quicker, then we can set a higher threshold for
>>> DoS detection, but we still do need Dos detection?
>>>  
>>>> 	b. Do we extend the publically available reset stats to also contain
>>>> 	per-engine reset statistics? If so, would this break the ABI?
>>>
>>> No. The get_reset_stats is targetted at the GL API and describing it in
>>> terms of whether my context is guilty or has been affected. That is
>>> orthogonal to whether the reset was on a single ring or the entire GPU -
>>> the question is how broad do want the "affected" to be. Ideally a
>>> per-context reset wouldn't necessarily impact others, except for the
>>> surfaces shared between them...
>>
>> gl computes sharing sets itself, the kernel only tells it whether a given
>> context has been victimized, i.e. one of it's batches was not properly
>> executed due to reset after a hang.
> 
> So you don't think we should delete all pending requests that depend
> upon state from the hung request?
> -Chris

John Harrison & I discussed this yesterday; he's against doing so (even
though the scheduler is ideally placed to do it, if that were actually
the preferred policy). The primary argument (as I see it) is that you
actually don't and can't know the nature of an apparent dependency
between batches that share a buffer object. There are at least three cases:

1. "tightly-coupled": the dependent batch is going to rely on data
produced by the earlier batch. In this case, GIGO applies and the
results will be undefined, possibly including a further hang. Subsequent
batches presumably belong to the same or a closely-related
(co-operating) task, and killing them might be a reasonable strategy here.

2. "loosely-coupled": the dependent batch is going to access the data,
but not in any way that depends on the content (for example, blitting a
rectangle into a composition buffer). The result will be wrong, but only
in a limited way (e.g. window belonging to the faulty application will
appear corrupted). The dependent batches may well belong to unrelated
system tasks (e.g. X or surfaceflinger) and killing them is probably not
justified.

3. "uncoupled": the dependent batch wants the /buffer/, not the data in
it (most likely a framebuffer or similar object). Any incorrect data in
the buffer is irrelevant. Killing off subsequent batches would be wrong.

Buffer access mode (readonly, read/write, writeonly) might allow us to
distinguish these somewhat, but probably not enough to help make the
right decision. So the default must be *not* to kill off dependants
automatically, but if the failure does propagate in such a way as to
cause further consequent hangs, then the context-banning mechanism
should eventually catch and block all the downstream effects.

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2015-06-18 11:12 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-08 17:03 [RFC 00/11] TDR/watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 01/11] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
2015-06-08 17:36   ` Chris Wilson
2015-06-09 11:02     ` Tomas Elf
2015-06-16 13:44   ` Daniel Vetter
2015-06-16 15:46     ` Tomas Elf
2015-06-16 16:50       ` Chris Wilson
2015-06-16 17:07         ` Tomas Elf
2015-06-17 11:43       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 02/11] drm/i915: Introduce uevent for full GPU reset Tomas Elf
2015-06-16 13:43   ` Daniel Vetter
2015-06-16 15:43     ` Tomas Elf
2015-06-16 16:55       ` Chris Wilson
2015-06-16 17:32         ` Tomas Elf
2015-06-16 19:33           ` Chris Wilson
2015-06-17 11:49             ` Daniel Vetter
2015-06-17 12:51               ` Chris Wilson
2015-06-08 17:03 ` [RFC 03/11] drm/i915: Add reset stats entry point for per-engine reset Tomas Elf
2015-06-08 17:33   ` Chris Wilson
2015-06-09 11:06     ` Tomas Elf
2015-06-16 13:48     ` Daniel Vetter
2015-06-16 13:54       ` Chris Wilson
2015-06-16 15:55         ` Daniel Vetter
2015-06-18 11:12         ` Dave Gordon [this message]
2015-06-11  9:14   ` Dave Gordon
2015-06-16 13:49   ` Daniel Vetter
2015-06-16 15:54     ` Tomas Elf
2015-06-17 11:51       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 04/11] drm/i915: Adding TDR / per-engine reset support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 05/11] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Tomas Elf
2015-06-08 17:24   ` Chris Wilson
2015-06-09 11:08     ` Tomas Elf
2015-06-09 11:11   ` Chris Wilson
2015-06-08 17:03 ` [RFC 06/11] drm/i915: Disable warnings for TDR interruptions in the display driver Tomas Elf
2015-06-08 17:53   ` Chris Wilson
2015-06-08 17:03 ` [RFC 07/11] drm/i915: Reinstate hang recovery work queue Tomas Elf
2015-06-08 17:03 ` [RFC 08/11] drm/i915: Watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 09/11] drm/i915: Fake lost context interrupts through forced CSB check Tomas Elf
2015-06-08 17:03 ` [RFC 10/11] drm/i915: Debugfs interface for per-engine hang recovery Tomas Elf
2015-06-08 17:45   ` Chris Wilson
2015-06-09 11:18     ` Tomas Elf
2015-06-09 12:27       ` Chris Wilson
2015-06-09 17:28         ` Tomas Elf
2015-06-11  9:32     ` Dave Gordon
2015-06-08 17:03 ` [RFC 11/11] drm/i915: TDR/watchdog trace points Tomas Elf
2015-06-23 10:05 ` [RFC 00/11] TDR/watchdog timeout support for gen8 Daniel Vetter
2015-06-23 10:47   ` Tomas Elf
2015-06-23 11:38     ` Daniel Vetter
2015-06-23 14:06       ` Tomas Elf
2015-06-23 15:20         ` Daniel Vetter
2015-06-23 15:35           ` Daniel Vetter
2015-06-25 10:38             ` Tomas Elf
2015-07-03 11:15 ` Mika Kuoppala
2015-07-03 17:41   ` Tomas Elf
2015-07-09 18:47 ` Chris Wilson
2015-07-10 15:24   ` Tomas Elf
2015-07-10 15:48     ` Tomas Elf
2015-07-11 18:15       ` Chris Wilson
2015-07-11 18:22     ` Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5582A780.7080506@intel.com \
    --to=david.s.gordon@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.