All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Eric Dumazet <edumazet@google.com>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	John Stultz <john.stultz@linaro.org>,
	Joonwoo Park <joonwoop@codeaurora.org>,
	Wenbo Wang <wenbo.wang@memblaze.com>
Subject: Re: [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation
Date: Wed, 27 May 2015 07:53:31 -0700	[thread overview]
Message-ID: <1432738411.4060.389.camel@edumazet-glaptop2.roam.corp.google.com> (raw)
In-Reply-To: <20150526210723.245729529@linutronix.de>

On Tue, 2015-05-26 at 22:50 +0000, Thomas Gleixner wrote:
> I still have a couple of patches against the timer code in my review
> list, but the more I look at them the more horrible I find them.
> 
> All these patches are related to the NOHZ stuff and try to mitigate
> shortcomings with even more bandaids. And of course these bandaids
> cost cycles and are a burden for timer heavy use cases like
> networking.
> 
> Sadly enough the NOHZ crowd is happy to throw more and more crap at
> the kernel and none of these people is even thinking about whether
> this stuff could be solved in a different way.
> 
> After Eric mentioned in one of the recent discussions that the
> timer_migration sysctl is not a lot of gain, I tried to mitigate at
> least that issue. That caused me to look deeper and I came up with the
> following patch series which:
> 
>   - Clean up the sloppy catchup timer jiffies stuff
> 
>   - Replace the hash bucket lists by hlists, which shrinks the wheel
>     footprint by 50%
> 
>   - Replaces the timer base pointer in struct timer_list by a cpu
>     index, which shrinks struct timer_list by 4 - 8 bytes depending on
>     alignement and architecture.
> 
>   - Caches nohz_active and timer_migration in the timer per cpu bases,
>     so we can avoid going through loops and hoops to figure that out.
> 
> This series applies against
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core
> 
> With a pretty naive timer test module and a sched other cpu hog on an
> isolated core I verified that I did not wreckage anything. The
> following table shows the resulting CPU time of the hog for the
> various scenarios.
> 
> 	     nohz=on	      	nohz=on			nohz=off
> 	     timer_migration=on	timer_migration=off	    
> 
> Unpatched:   46.57%		46.87%			46.64%
> 
> Patched:     47.76%		48.20%			48.73%
> 
> Though, I do not trust my numbers, so I really hope that Eric or any
> other power timer wheel user can throw a bunch of tests at it.
> 
> 
> Now some more rant about the whole NOHZ issues. The timer wheel is
> fundamentally unfriendly for NOHZ because:
> 
>   - it's impossible to keep reliably track of the next expiring timer
> 
>   - finding the next expiring timer is in the worst case O(N) and
>     involves touching a gazillion of cache lines
> 
> The other disaster area is the nohz timer migration mechanism. I think
> it's fundamentally broken. 
> 
>  - We decide at timer enqueue time, on which CPU we queue the timer,
>    based on cpu idle heuristics which even fails to recognize that a
>    cpu is really busy with interrupt and softirq processing (reported
>    by Eric).
> 
>  - When moving a timer to some "non-idle" CPU we speculate about the
>    system situation in the future (at timer expiry time). This kinda
>    works for limited scenarios (NOHZ_FULL and constraint power saving
>    on mobile devices). But it is pretty much nonsensical for
>    everything else. For network heavy workloads it can be even
>    outright wrong and dangerous as Eric explained.
> 
> So we really need to put a full stop at all this NOHZ tinkering and
> think about proper solutions. I'm not going to merge any NOHZ related
> features^Whacks before the existing problems are not solved. In
> hindsight I should have done that earlier, but it's never too late.
> 
> So we need to address two issues:
> 
> 1) Keeping track of the first expiring timer in the wheel.
> 
>    Don't even think about rbtrees or such, it just wont work, but I'm
>    willing to accept prove of the contrary.
> 
>    One of the ideas I had a long time ago is to have a wheel
>    implementation, which does never cascade and therefor provides
>    different levels of granularity per wheel level.
> 
>    LVL0	   1  jiffy granularity
>    LVL1	   8  jiffies granularity
>    LVL1	   64 jiffies granularity
>    ....
> 
>    This requires more levels than the classic timer wheel, so its not
>    feasible from a memory consumption POV.
> 
>    But we can have a compromise and put all 'soon' expiring timers
>    into these fancy new wheels and stick all 'far out' timers into the
>    last level of the wheel and cascade them occasionally.
> 
>    That should work because:
> 
>      - Most timers are short term expiry (< 1h)
>      - Most timers are canceled before expiry
> 
>    So we need a sensible upper bound of levels and get the following
>    properties:
> 
>    	- Natural batching (no more slack magic). This might also help
>           networking to avoid rearming of timers.
> 
> 	- Long out timers are queued in the slowest wheel and
>           ocassionally with the granularity of the slowest wheel
>           cascaded
> 
> 	- Tracking the next expiring timer can be done with a bitmap,
>           so the 'get next expiring timer' becomes O(1) without
>           touching anything else than the bitmap, if we accept that
>           the upper limit of what we can retrieve O(1) is the
>           granularity of the last level, i.e. we treat timers which
>           need recascading simple as normal inhabitants of the last
>           wheel level.
> 	  
>         - The expiry code (run_timers) gets more complicated as it
>           needs to walk through the different levels more often, but
>           with the bitmap that shouldnt be a too big issue.
> 
>         - Seperate storage for non-deferrable and deferrable timers.
> 
>     I spent some time in coding that up. It barely boots and has
>     certainly a few bugs left and right, but I will send it out
>     nevertheless as a reply to this mail to get the discussion going.

Thanks a lot Thomas !

I finally got back two hosts with 72 cpus where I can try your patches.

I have one comment about socket TCP timers : They are re-armed very
often, but they are never canceled (unless the socket is dismantled)

I am too lazy to search for the commit adding this stuff (before git),
but I guess it helped a lot RPC sessions : No need to cancel a timer if
we're likely to rearm it in the following 10 usec ;)

$ git grep -n INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:29:#undef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:199:#ifdef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:204:#ifdef INET_CSK_CLEAR_TIMERS



  parent reply	other threads:[~2015-05-27 14:53 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-26 22:50 [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner
2015-05-26 22:50 ` [patch 1/7] timers: Sanitize catchup_timer_jiffies() usage Thomas Gleixner
2015-05-27  5:39   ` Viresh Kumar
2015-06-19 13:22   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 2/7] timer: Remove FIFO guarantee Thomas Gleixner
2015-05-27  9:11   ` Viresh Kumar
2015-06-19 13:22   ` [tip:timers/core] timer: Remove FIFO "guarantee" tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 3/7] timer: Use hlist for the timer wheel hash buckets Thomas Gleixner
2015-05-27  9:13   ` Viresh Kumar
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 4/7] timer: Replace timer base by a cpu index Thomas Gleixner
2015-05-27  9:22   ` Viresh Kumar
2015-05-27 12:09     ` Peter Zijlstra
2015-06-02 13:58       ` Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-06-27  9:55   ` [PATCH] timer: Fix unsafe cpu variable access in migrate_timers Jan Kiszka
2015-06-27 11:00     ` Borislav Petkov
2015-06-27 11:19       ` Jan Kiszka
2015-06-27 11:25         ` Borislav Petkov
2015-05-26 22:50 ` [patch 5/7] timer: stats: Simplify the flags handling Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] timer: Stats: " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 6/7] timer: Reduce timer migration overhead if disabled Thomas Gleixner
2015-06-19 13:23   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-26 22:50 ` [patch 7/7] timer: Minimize nohz off overhead Thomas Gleixner
2015-06-19 13:24   ` [tip:timers/core] " tip-bot for Thomas Gleixner
2015-05-27 14:53 ` Eric Dumazet [this message]
2015-06-02 13:58   ` [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1432738411.4060.389.camel@edumazet-glaptop2.roam.corp.google.com \
    --to=eric.dumazet@gmail.com \
    --cc=edumazet@google.com \
    --cc=fweisbec@gmail.com \
    --cc=john.stultz@linaro.org \
    --cc=joonwoop@codeaurora.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=viresh.kumar@linaro.org \
    --cc=wenbo.wang@memblaze.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.