All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tomas Elf <tomas.elf@intel.com>
Cc: Intel-GFX@Lists.FreeDesktop.Org
Subject: Re: [RFC 00/11] TDR/watchdog timeout support for gen8
Date: Sat, 11 Jul 2015 19:22:28 +0100	[thread overview]
Message-ID: <20150711182228.GE21656@nuc-i3427.alporthouse.com> (raw)
In-Reply-To: <559FE3B1.6090902@intel.com>

On Fri, Jul 10, 2015 at 04:24:33PM +0100, Tomas Elf wrote:
> On 09/07/2015 19:47, Chris Wilson wrote:
> >On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote:
> >>This patch series introduces the following features:
> >>
> >>* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.
> >>* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
> >>* Feature 3. Context Submission Status Consistency checking
> >
> >The high level design is reasonable and conceptually extends the current
> >system in fairly obvious ways.
> >
> >In terms of discussing the implementation, I think this can be phased
> >as:
> >
> >0. Move to a per-engine hangcheck
> >
> >1. Add fake-interrupt recovery for CSSC
> >
> >   I think this can be done without changing the hangcheck heuristics at
> >   all - we just need to think carefully about recovery (which is a nice
> >   precursor to per-engine reset). I may be wrong, and so would like to
> >   be told so early on! If the fake interrupt recovery is done as part of
> >   the reset handler, we should have (one) fewer concurrency issues to
> >   worry about.
> 
> Some points about moving the CSSC out of the hang checker and into
> the reset handler:
> 
> 1. If we deal with consistency rectification in the reset handler
> the turnaround time becomes REALLY long:
> 		
> 	a. First you have the time to detect the hang, call
> i915_handle_error() that then raises the reset in progress flag,
> preventing further submissions to the driver.
> 
> 	b. Then go all the way to the per-engine recovery path, only to
> discover that we've got an inconsistency that has not been handled,
> fall back immediately with -EAGAIN and lower the reset in progress
> flag, let the system continue running and defer to the next hang
> check (another hang detection period)

Urm, there is no per-engine recovery at this phase. And even if there were,
why would you do that if you could detect the inconsistency and just
issue the fake interrupt? This is why I think phasing it in this way
makes it more obvious about the choices we can make during reset
handling.

[snip]

> 2. Secondly, and more importantly, if a watchdog timeout is detected
> and we end up in the per-engine hang recovery path and have to fall
> back due to an inconsistent context submission state at that point
> and the hang checker is turned off then we're irrecoverably hung.
> Watchdog timeout is supposed to work without the periodic hang
> checker but it won't if CSSC is not ensured at all times. Which is
> why I chose to override the i915.enable_hangcheck flag to make sure
> that the hang checker always runs consistency pre-checking and
> reschedules itself if there is more work pending to make sure that
> as long as work is pending we do consistency checking asynchronously
> regardless of everything else so that if a watchdog timeout hits we
> have a consistent state once the watchdog timeout ends up in
> per-engine recovery.

Again, what? If the watchdog mechanism is just a faster way to detect
hangs and queue the reset task, what the reset does is then independent
of how we detect the hang.
 
> Granted, if a watchdog timeout hits after we've first detected the
> inconsistency but not yet had time to rectify it it doesn't work if
> the hang checker is turned off and we cannot rely on periodic hang
> checking to schedule hang recovery in this case - so in that case
> we're still irrecoverably stuck. We could make change here and do a
> one-time i915.enable_hangcheck override and schedule hang recovery
> following this point. If you think it's worth it.
> 	
> Bottom line: The consistency checking must happen at all times and
> cannot be done as a consequence of a scheduled reset if hang
> checking is turned off at any point.

I strongly disagree. I think of this as two unique tasks, (a) detect
hang and (b) do something about it. At the moment (b) isn't smart
enough to work out the minimum it has to do in order to recover.

Which is why I suggest you think about adding the simpler fake interrupt
into the reset path first, before you even think about doing TDR.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

      parent reply	other threads:[~2015-07-11 18:22 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-08 17:03 [RFC 00/11] TDR/watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 01/11] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
2015-06-08 17:36   ` Chris Wilson
2015-06-09 11:02     ` Tomas Elf
2015-06-16 13:44   ` Daniel Vetter
2015-06-16 15:46     ` Tomas Elf
2015-06-16 16:50       ` Chris Wilson
2015-06-16 17:07         ` Tomas Elf
2015-06-17 11:43       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 02/11] drm/i915: Introduce uevent for full GPU reset Tomas Elf
2015-06-16 13:43   ` Daniel Vetter
2015-06-16 15:43     ` Tomas Elf
2015-06-16 16:55       ` Chris Wilson
2015-06-16 17:32         ` Tomas Elf
2015-06-16 19:33           ` Chris Wilson
2015-06-17 11:49             ` Daniel Vetter
2015-06-17 12:51               ` Chris Wilson
2015-06-08 17:03 ` [RFC 03/11] drm/i915: Add reset stats entry point for per-engine reset Tomas Elf
2015-06-08 17:33   ` Chris Wilson
2015-06-09 11:06     ` Tomas Elf
2015-06-16 13:48     ` Daniel Vetter
2015-06-16 13:54       ` Chris Wilson
2015-06-16 15:55         ` Daniel Vetter
2015-06-18 11:12         ` Dave Gordon
2015-06-11  9:14   ` Dave Gordon
2015-06-16 13:49   ` Daniel Vetter
2015-06-16 15:54     ` Tomas Elf
2015-06-17 11:51       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 04/11] drm/i915: Adding TDR / per-engine reset support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 05/11] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Tomas Elf
2015-06-08 17:24   ` Chris Wilson
2015-06-09 11:08     ` Tomas Elf
2015-06-09 11:11   ` Chris Wilson
2015-06-08 17:03 ` [RFC 06/11] drm/i915: Disable warnings for TDR interruptions in the display driver Tomas Elf
2015-06-08 17:53   ` Chris Wilson
2015-06-08 17:03 ` [RFC 07/11] drm/i915: Reinstate hang recovery work queue Tomas Elf
2015-06-08 17:03 ` [RFC 08/11] drm/i915: Watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 09/11] drm/i915: Fake lost context interrupts through forced CSB check Tomas Elf
2015-06-08 17:03 ` [RFC 10/11] drm/i915: Debugfs interface for per-engine hang recovery Tomas Elf
2015-06-08 17:45   ` Chris Wilson
2015-06-09 11:18     ` Tomas Elf
2015-06-09 12:27       ` Chris Wilson
2015-06-09 17:28         ` Tomas Elf
2015-06-11  9:32     ` Dave Gordon
2015-06-08 17:03 ` [RFC 11/11] drm/i915: TDR/watchdog trace points Tomas Elf
2015-06-23 10:05 ` [RFC 00/11] TDR/watchdog timeout support for gen8 Daniel Vetter
2015-06-23 10:47   ` Tomas Elf
2015-06-23 11:38     ` Daniel Vetter
2015-06-23 14:06       ` Tomas Elf
2015-06-23 15:20         ` Daniel Vetter
2015-06-23 15:35           ` Daniel Vetter
2015-06-25 10:38             ` Tomas Elf
2015-07-03 11:15 ` Mika Kuoppala
2015-07-03 17:41   ` Tomas Elf
2015-07-09 18:47 ` Chris Wilson
2015-07-10 15:24   ` Tomas Elf
2015-07-10 15:48     ` Tomas Elf
2015-07-11 18:15       ` Chris Wilson
2015-07-11 18:22     ` Chris Wilson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150711182228.GE21656@nuc-i3427.alporthouse.com \
    --to=chris@chris-wilson.co.uk \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=tomas.elf@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.