From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754256AbbIILJ1 (ORCPT ); Wed, 9 Sep 2015 07:09:27 -0400 Received: from foss.arm.com ([217.140.101.70]:57971 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752780AbbIILJT (ORCPT ); Wed, 9 Sep 2015 07:09:19 -0400 Date: Wed, 9 Sep 2015 12:13:10 +0100 From: Morten Rasmussen To: Peter Zijlstra Cc: Vincent Guittot , Dietmar Eggemann , Steve Muckle , "mingo@redhat.com" , "daniel.lezcano@linaro.org" , "yuyang.du@intel.com" , "mturquette@baylibre.com" , "rjw@rjwysocki.net" , Juri Lelli , "sgurrappadi@nvidia.com" , "pang.xunlei@zte.com.cn" , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig Message-ID: <20150909111309.GD27098@e105550-lin.cambridge.arm.com> References: <55E8DD00.2030706@linaro.org> <55EDAF43.30500@arm.com> <55EDDD5A.70904@arm.com> <20150908122606.GH3644@twins.programming.kicks-ass.net> <20150908125205.GW18673@twins.programming.kicks-ass.net> <20150908143157.GA27098@e105550-lin.cambridge.arm.com> <20150908165331.GC27098@e105550-lin.cambridge.arm.com> <20150909094305.GO3644@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150909094305.GO3644@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote: > On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote: > > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote: > > > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote: > > > But if we apply the scaling to the weight instead of time, we would only > > > have to apply it once and not three times like it is now? So maybe we > > > can end up with almost the same number of multiplications. > > > > > > We might be loosing bits for low priority task running on cpus at a low > > > frequency though. > > > > Something like the below. We should be saving one multiplication. > > > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > return 0; > > sa->last_update_time = now; > > > > - scale_freq = arch_scale_freq_capacity(NULL, cpu); > > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > > + if (weight || running) > > + scale_freq = arch_scale_freq_capacity(NULL, cpu); > > + if (weight) > > + scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT; > > + if (running) > > + scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu) > > + >> SCHED_CAPACITY_SHIFT; > > > > /* delta_w is the amount already accumulated against our next period */ > > delta_w = sa->period_contrib; > > @@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > * period and accrue it. > > */ > > delta_w = 1024 - delta_w; > > - scaled_delta_w = cap_scale(delta_w, scale_freq); > > if (weight) { > > - sa->load_sum += weight * scaled_delta_w; > > + sa->load_sum += scaled_weight * delta_w; > > if (cfs_rq) { > > cfs_rq->runnable_load_sum += > > - weight * scaled_delta_w; > > + scaled_weight * delta_w; > > } > > } > > if (running) > > - sa->util_sum += scaled_delta_w * scale_cpu; > > + sa->util_sum += delta_w * scale_freq_cpu; > > > > delta -= delta_w; > > > > Sadly that makes the code worse; I get 14 mul instructions where > previously I had 11. > > What happens is that GCC gets confused and cannot constant propagate the > new variables, so what used to be shifts now end up being actual > multiplications. > > With this, I get back to 11. Can you see what happens on ARM where you > have both functions defined to non constants? We repeated the experiment on arm and arm64 but still with functions defined to constant to compare with your results. The mul instruction count seems to be somewhat compiler version dependent, but consistently show no effect of the patch: arm before after gcc4.9 12 12 gcc4.8 10 10 arm64 before after gcc4.9 11 11 I will get numbers with the arch-functions implemented as well and do hackbench runs to see what happens in terms of performance.