All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tomas Elf <tomas.elf@intel.com>
Cc: Intel-GFX@Lists.FreeDesktop.Org
Subject: Re: [RFC 00/11] TDR/watchdog timeout support for gen8
Date: Sat, 11 Jul 2015 19:15:43 +0100	[thread overview]
Message-ID: <20150711181543.GD21656@nuc-i3427.alporthouse.com> (raw)
In-Reply-To: <559FE965.8070807@intel.com>

On Fri, Jul 10, 2015 at 04:48:53PM +0100, Tomas Elf wrote:
> On 10/07/2015 16:24, Tomas Elf wrote:
> >On 09/07/2015 19:47, Chris Wilson wrote:
> >>On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote:
> >>>This patch series introduces the following features:
> >>>
> >>>* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist
> >>>mode.
> >>>* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
> >>>* Feature 3. Context Submission Status Consistency checking
> >>
> >>The high level design is reasonable and conceptually extends the current
> >>system in fairly obvious ways.
> >>
> >>In terms of discussing the implementation, I think this can be phased
> >>as:
> >>
> >>0. Move to a per-engine hangcheck
> >>
> >>1. Add fake-interrupt recovery for CSSC
> >>
> >>   I think this can be done without changing the hangcheck heuristics at
> >>   all - we just need to think carefully about recovery (which is a nice
> >>   precursor to per-engine reset). I may be wrong, and so would like to
> >>   be told so early on! If the fake interrupt recovery is done as part of
> >>   the reset handler, we should have (one) fewer concurrency issues to
> >>   worry about.
> >
> >Some points about moving the CSSC out of the hang checker and into the
> >reset handler:
> >
> >1. If we deal with consistency rectification in the reset handler the
> >turnaround time becomes REALLY long:
> >
> >     a. First you have the time to detect the hang, call
> >i915_handle_error() that then raises the reset in progress flag,
> >preventing further submissions to the driver.
> >
> >     b. Then go all the way to the per-engine recovery path, only to
> >discover that we've got an inconsistency that has not been handled, fall
> >back immediately with -EAGAIN and lower the reset in progress flag, let
> >the system continue running and defer to the next hang check (another
> >hang detection period)
> >
> >     c. Once the hang has been detected AGAIN, raise the reset in
> >progress flag AGAIN and go back to the engine reset path a second time.
> >
> >     d. At the start of the engine reset path we do the second CSSC
> >detection and realise that we've got a stable inconsistency that we can
> >attempt to rectify. We can then try to rectify the inconsistency and go
> >through with the engine reset... AFTER we've checked that the
> >inconsistency rectification was indeed effective! If it's not and the
> >problem remains then we have to fail the engine recovery mode and fall
> >back to full GPU reset immediately... Which we could have done from the
> >hang checker if we had just refused to schedule hang recovery and just
> >let the context submission state inconsistency persist and let the hang
> >score keep rising until the hang score reached 2*HUNG, which would then
> >have triggered the full GPU reset fallback from the hang checker (see
> >path 9/11 for all of this)
> >
> >As you can see, dealing with context submission state inconsistency in
> >the reset path is very long-winded way of doing it and does not make it
> >more reliable. Also, it's more complicated to analyse from a concurrency
> >point of view since we need to fall back several times and raise and
> >lower the reset in progress flag, which allows driver submissions to
> >happen vs. blocks submissions. It basically becomes very difficult to
> >know what is going on.
> >
> >2. Secondly, and more importantly, if a watchdog timeout is detected and
> >we end up in the per-engine hang recovery path and have to fall back due
> >to an inconsistent context submission state at that point and the hang
> >checker is turned off then we're irrecoverably hung. Watchdog timeout is
> >supposed to work without the periodic hang checker but it won't if CSSC
> >is not ensured at all times. Which is why I chose to override the
> >i915.enable_hangcheck flag to make sure that the hang checker always
> >runs consistency pre-checking and reschedules itself if there is more
> >work pending to make sure that as long as work is pending we do
> >consistency checking asynchronously regardless of everything else so
> >that if a watchdog timeout hits we have a consistent state once the
> >watchdog timeout ends up in per-engine recovery.
> >
> >Granted, if a watchdog timeout hits after we've first detected the
> >inconsistency but not yet had time to rectify it it doesn't work if the
> >hang checker is turned off and we cannot rely on periodic hang checking
> >to schedule hang recovery in this case - so in that case we're still
> >irrecoverably stuck. We could make change here and do a one-time
> >i915.enable_hangcheck override and schedule hang recovery following this
> >point. If you think it's worth it.
> >
> >Bottom line: The consistency checking must happen at all times and
> >cannot be done as a consequence of a scheduled reset if hang checking is
> >turned off at any point.
> >
> >As far as concurrency issues in the face of CSSC is concerned,
> >disregarding the complication of handling CSSC in the recovery path and
> >relying on deferring to the next hang detection    with all of the
> >concurrency issues that entails: The question really is what kind of
> >concurrency issues we're worried about. If the hang checker determines
> >that we've got a hang then that's a stable state. If the hang checker
> >consistency pre-check determines that we've got a sustained CSSC
> >inconsistency then that's stable too. The states are not changing so
> >whatever we do will not be because we detect the state in the middle of
> >a state transition and the detection won't be subject to concurrency
> >effects. If the hang checker decides that the inconsistency needs to be
> >rectified and fakes the presumably lost interrupt and the real, presumed
> >lost, interrupt happens to come in at the same time then that's fine,
> >the CSB buffer check in the execlist interrupt handler is made to cope
> >with that. We can have X number of calls to the interrupt handler or
> >just one, the outcome is supposed to be the same - the only thing that
> >matters is captured context state changes in the CSB buffer that we act
> >upon.
> >
> >So I'm not entirely sure what concurrency issues might be reason enough
> >to move out the CSSC to the hang recovery path. In fact, I'd be more
> >inclined to create a second async task for it to make sure it's being
> >run at all times. But in that case we might as well let it stay in the
> >hang checker.
> >
> >(In a similar vein, I think we should move the missed
> >>   interupt handler for legacy out of hangcheck, partly to simplify some
> >>   very confusing code and partly so that we have fewer deviations
> >>   between legacy/execlists paths.) It also gets us thinking about the
> >>   consistency detection and when it is viable to do a fake-interrupt and
> >>   when we must do a full-reset (for example, we only want to
> >>   fake-interrupt if the consistency check says the GPU is idle, and we
> >>   definitely want to reset everything if the GPU is executing an alien
> >>   context.)
> >>
> >>   A test here would be to suspend the execlists irq and wait for the
> >>   recovery. Cheekily we could punch the irq eir by root mmio and check
> >>   hangcheck automagically recovers.
> >>
> >>Whilst it would be nice to add the watchdog next, since it is
> >>conceptually quite simple and basically just a new hangcheck source with
> >>fancy setup - fast hangcheck without soft reset makes for an easy DoS.
> >>
> >>2. TDR
> >>
> >>   Given that we have a consistency check and begun to extend the reset
> >>   path, we can implement a soft reset that only skips the hung request.
> >>   (The devil is in the details, whilst the design here looked solid, I
> >>   think the LRC recovery code could be simplified - I didn't feel
> >>   another requeue path was required given that we need only pretend a
> >>   request completed (fixing up the context image as required) and then
> >>   use the normal unqueue.) There is also quite a bit of LRC cleanup on
> >>   the lists which would be useful here.
> >
> >As far as the new requeue (or resubmission) path is concerned, you might
> >have a point here. The reason it's as involved as it is is probably
> >mostly because of all the validation that takes place in the
> >resubmission path. Meaning that once the resubmission happens at the end
> >of the per-engine hang recovery path we want to make extra sure that the
> >context that gets resubmitted in the end (the head element of the queue
> >at that point in time) is in fact the one that was passed down from the
> >per-engine hang recovery path (the context at the head of the queue at
> >the start of the hang recovery path), so that the state of the queue
> >didn't change during hang recovery. Maybe we're too paranoid here.
> >
> 
> Ah, yes, there is one crucial difference between the normal
> execlists_context_unqueue() function and
> execlists_TDR_context_unqueue() that means that we cannot fully
> reuse the former one for TDR-specific purposes.
> 
> When we resubmit the context via TDR it's important that we do not
> increment the elsp_submitted counter and otherwise treat the
> resubmission as a normal submission. The reason for this is that the
> hardware consistently refuses to send out any kind of interrupt
> acknowledging the context resubmission. If you just submit the
> context and increment elsp_submitted for that context, like you
> normally do, the interrupt handler will sit around forever waiting
> for the interrupt that will never come for the resubmission. It will
> wait - and receive - the interrupt for the original submission that
> caused the original hang and also the interrupt for the context
> completion following hang recovery. But it won't receive an
> interrupt for the resubmission.

But that's just an issue with the current poor design of execlists...

If we tracked exactly what request we submitted to each port, and a
unique ctx id tag for every submission (incl subsumed ones for easier
completion handling) you completely eliminate the need for elsp_submitted.
And a TDR resubmit is just that. (And also the code is much smaller and
much simpler.)

So I don't think you have sufficiently reworked execlists to avoid
unnecessary duplication.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2015-07-11 18:15 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-08 17:03 [RFC 00/11] TDR/watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 01/11] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
2015-06-08 17:36   ` Chris Wilson
2015-06-09 11:02     ` Tomas Elf
2015-06-16 13:44   ` Daniel Vetter
2015-06-16 15:46     ` Tomas Elf
2015-06-16 16:50       ` Chris Wilson
2015-06-16 17:07         ` Tomas Elf
2015-06-17 11:43       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 02/11] drm/i915: Introduce uevent for full GPU reset Tomas Elf
2015-06-16 13:43   ` Daniel Vetter
2015-06-16 15:43     ` Tomas Elf
2015-06-16 16:55       ` Chris Wilson
2015-06-16 17:32         ` Tomas Elf
2015-06-16 19:33           ` Chris Wilson
2015-06-17 11:49             ` Daniel Vetter
2015-06-17 12:51               ` Chris Wilson
2015-06-08 17:03 ` [RFC 03/11] drm/i915: Add reset stats entry point for per-engine reset Tomas Elf
2015-06-08 17:33   ` Chris Wilson
2015-06-09 11:06     ` Tomas Elf
2015-06-16 13:48     ` Daniel Vetter
2015-06-16 13:54       ` Chris Wilson
2015-06-16 15:55         ` Daniel Vetter
2015-06-18 11:12         ` Dave Gordon
2015-06-11  9:14   ` Dave Gordon
2015-06-16 13:49   ` Daniel Vetter
2015-06-16 15:54     ` Tomas Elf
2015-06-17 11:51       ` Daniel Vetter
2015-06-08 17:03 ` [RFC 04/11] drm/i915: Adding TDR / per-engine reset support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 05/11] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Tomas Elf
2015-06-08 17:24   ` Chris Wilson
2015-06-09 11:08     ` Tomas Elf
2015-06-09 11:11   ` Chris Wilson
2015-06-08 17:03 ` [RFC 06/11] drm/i915: Disable warnings for TDR interruptions in the display driver Tomas Elf
2015-06-08 17:53   ` Chris Wilson
2015-06-08 17:03 ` [RFC 07/11] drm/i915: Reinstate hang recovery work queue Tomas Elf
2015-06-08 17:03 ` [RFC 08/11] drm/i915: Watchdog timeout support for gen8 Tomas Elf
2015-06-08 17:03 ` [RFC 09/11] drm/i915: Fake lost context interrupts through forced CSB check Tomas Elf
2015-06-08 17:03 ` [RFC 10/11] drm/i915: Debugfs interface for per-engine hang recovery Tomas Elf
2015-06-08 17:45   ` Chris Wilson
2015-06-09 11:18     ` Tomas Elf
2015-06-09 12:27       ` Chris Wilson
2015-06-09 17:28         ` Tomas Elf
2015-06-11  9:32     ` Dave Gordon
2015-06-08 17:03 ` [RFC 11/11] drm/i915: TDR/watchdog trace points Tomas Elf
2015-06-23 10:05 ` [RFC 00/11] TDR/watchdog timeout support for gen8 Daniel Vetter
2015-06-23 10:47   ` Tomas Elf
2015-06-23 11:38     ` Daniel Vetter
2015-06-23 14:06       ` Tomas Elf
2015-06-23 15:20         ` Daniel Vetter
2015-06-23 15:35           ` Daniel Vetter
2015-06-25 10:38             ` Tomas Elf
2015-07-03 11:15 ` Mika Kuoppala
2015-07-03 17:41   ` Tomas Elf
2015-07-09 18:47 ` Chris Wilson
2015-07-10 15:24   ` Tomas Elf
2015-07-10 15:48     ` Tomas Elf
2015-07-11 18:15       ` Chris Wilson [this message]
2015-07-11 18:22     ` Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150711181543.GD21656@nuc-i3427.alporthouse.com \
    --to=chris@chris-wilson.co.uk \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=tomas.elf@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.