All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Vetter <daniel@ffwll.ch>
To: Chris Wilson <chris@chris-wilson.co.uk>,
	Daniel Vetter <daniel@ffwll.ch>, Tomas Elf <tomas.elf@intel.com>,
	Intel-GFX@Lists.FreeDesktop.Org
Subject: Re: [RFC 03/11] drm/i915: Add reset stats entry point for per-engine reset.
Date: Tue, 16 Jun 2015 17:55:34 +0200	[thread overview]
Message-ID: <20150616155534.GZ23637@phenom.ffwll.local> (raw)
In-Reply-To: <20150616135449.GE11933@nuc-i3427.alporthouse.com>

On Tue, Jun 16, 2015 at 02:54:49PM +0100, Chris Wilson wrote:
> On Tue, Jun 16, 2015 at 03:48:09PM +0200, Daniel Vetter wrote:
> > On Mon, Jun 08, 2015 at 06:33:59PM +0100, Chris Wilson wrote:
> > > On Mon, Jun 08, 2015 at 06:03:21PM +0100, Tomas Elf wrote:
> > > > In preparation for per-engine reset add way for setting context reset stats.
> > > > 
> > > > OPEN QUESTIONS:
> > > > 1. How do we deal with get_reset_stats and the GL robustness interface when
> > > > introducing per-engine resets?
> > > > 
> > > > 	a. Do we set context that cause per-engine resets as guilty? If so, how
> > > > 	does this affect context banning?
> > > 
> > > Yes. If the reset works quicker, then we can set a higher threshold for
> > > DoS detection, but we still do need Dos detection?
> > >  
> > > > 	b. Do we extend the publically available reset stats to also contain
> > > > 	per-engine reset statistics? If so, would this break the ABI?
> > > 
> > > No. The get_reset_stats is targetted at the GL API and describing it in
> > > terms of whether my context is guilty or has been affected. That is
> > > orthogonal to whether the reset was on a single ring or the entire GPU -
> > > the question is how broad do want the "affected" to be. Ideally a
> > > per-context reset wouldn't necessarily impact others, except for the
> > > surfaces shared between them...
> > 
> > gl computes sharing sets itself, the kernel only tells it whether a given
> > context has been victimized, i.e. one of it's batches was not properly
> > executed due to reset after a hang.
> 
> So you don't think we should delete all pending requests that depend
> upon state from the hung request?

Tbh I haven't fully thought through what happens with partial resets.
Looking into the future with hardware faulting/svm it's clear that soonish
the kernel won't even be in a position to know depencies. And userspace
already needs to take any kind of texture sharing into account when
computing certain arb_robustness values.

Given that I'm leaning towards a lean implementation in the kernel of only
marking the actual victim batches/contexts and simply continuing to
execute everything else. That has a bit the risk of ending up in continual
resets if a bit of corruption causes all follow-up batches to fail, but
that's something we need to be able to handle (using a full-blown reset
where we throw away all the batches) anyway. And eventually even
escalating to refusing gpu accesses to repeat offenders.

But definitely something we need to decide upon, and something which needs
to be carefully tested with nasty igts for all corner cases. And
preferrably also at least some basic multi-context testcases on top of
mesa/libva robustness.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2015-06-16 15:52 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-08 17:03 [RFC 00/11] TDR/watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 01/11] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
2015-06-08 17:36   ` Chris Wilson
2015-06-09 11:02     ` Tomas Elf
2015-06-16 13:44   ` Daniel Vetter
2015-06-16 15:46     ` Tomas Elf
2015-06-16 16:50       ` Chris Wilson
2015-06-16 17:07         ` Tomas Elf
2015-06-17 11:43       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 02/11] drm/i915: Introduce uevent for full GPU reset Tomas Elf
2015-06-16 13:43   ` Daniel Vetter
2015-06-16 15:43     ` Tomas Elf
2015-06-16 16:55       ` Chris Wilson
2015-06-16 17:32         ` Tomas Elf
2015-06-16 19:33           ` Chris Wilson
2015-06-17 11:49             ` Daniel Vetter
2015-06-17 12:51               ` Chris Wilson
2015-06-08 17:03 ` [RFC 03/11] drm/i915: Add reset stats entry point for per-engine reset Tomas Elf
2015-06-08 17:33   ` Chris Wilson
2015-06-09 11:06     ` Tomas Elf
2015-06-16 13:48     ` Daniel Vetter
2015-06-16 13:54       ` Chris Wilson
2015-06-16 15:55         ` Daniel Vetter [this message]
2015-06-18 11:12         ` Dave Gordon
2015-06-11  9:14   ` Dave Gordon
2015-06-16 13:49   ` Daniel Vetter
2015-06-16 15:54     ` Tomas Elf
2015-06-17 11:51       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 04/11] drm/i915: Adding TDR / per-engine reset support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 05/11] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Tomas Elf
2015-06-08 17:24   ` Chris Wilson
2015-06-09 11:08     ` Tomas Elf
2015-06-09 11:11   ` Chris Wilson
2015-06-08 17:03 ` [RFC 06/11] drm/i915: Disable warnings for TDR interruptions in the display driver Tomas Elf
2015-06-08 17:53   ` Chris Wilson
2015-06-08 17:03 ` [RFC 07/11] drm/i915: Reinstate hang recovery work queue Tomas Elf
2015-06-08 17:03 ` [RFC 08/11] drm/i915: Watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 09/11] drm/i915: Fake lost context interrupts through forced CSB check Tomas Elf
2015-06-08 17:03 ` [RFC 10/11] drm/i915: Debugfs interface for per-engine hang recovery Tomas Elf
2015-06-08 17:45   ` Chris Wilson
2015-06-09 11:18     ` Tomas Elf
2015-06-09 12:27       ` Chris Wilson
2015-06-09 17:28         ` Tomas Elf
2015-06-11  9:32     ` Dave Gordon
2015-06-08 17:03 ` [RFC 11/11] drm/i915: TDR/watchdog trace points Tomas Elf
2015-06-23 10:05 ` [RFC 00/11] TDR/watchdog timeout support for gen8 Daniel Vetter
2015-06-23 10:47   ` Tomas Elf
2015-06-23 11:38     ` Daniel Vetter
2015-06-23 14:06       ` Tomas Elf
2015-06-23 15:20         ` Daniel Vetter
2015-06-23 15:35           ` Daniel Vetter
2015-06-25 10:38             ` Tomas Elf
2015-07-03 11:15 ` Mika Kuoppala
2015-07-03 17:41   ` Tomas Elf
2015-07-09 18:47 ` Chris Wilson
2015-07-10 15:24   ` Tomas Elf
2015-07-10 15:48     ` Tomas Elf
2015-07-11 18:15       ` Chris Wilson
2015-07-11 18:22     ` Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150616155534.GZ23637@phenom.ffwll.local \
    --to=daniel@ffwll.ch \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=chris@chris-wilson.co.uk \
    --cc=tomas.elf@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.