From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754256AbbIILJ1 (ORCPT <rfc822;w@1wt.eu>);
	Wed, 9 Sep 2015 07:09:27 -0400
Received: from foss.arm.com ([217.140.101.70]:57971 "EHLO foss.arm.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752780AbbIILJT (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 9 Sep 2015 07:09:19 -0400
Date: Wed, 9 Sep 2015 12:13:10 +0100
From: Morten Rasmussen <morten.rasmussen@arm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steve Muckle <steve.muckle@linaro.org>,
        "mingo@redhat.com" <mingo@redhat.com>,
        "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>,
        "yuyang.du@intel.com" <yuyang.du@intel.com>,
        "mturquette@baylibre.com" <mturquette@baylibre.com>,
        "rjw@rjwysocki.net" <rjw@rjwysocki.net>,
        Juri Lelli <Juri.Lelli@arm.com>,
        "sgurrappadi@nvidia.com" <sgurrappadi@nvidia.com>,
        "pang.xunlei@zte.com.cn" <pang.xunlei@zte.com.cn>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by
 capacity_orig
Message-ID: <20150909111309.GD27098@e105550-lin.cambridge.arm.com>
References: <55E8DD00.2030706@linaro.org>
 <55EDAF43.30500@arm.com>
 <CAKfTPtBmgc=7JMRcTL9VYdHxb7qgXBDFc62rb-jnSVCePwJNsg@mail.gmail.com>
 <55EDDD5A.70904@arm.com>
 <CAKfTPtA0N-YTFMpN8-8ZbwakcsaY7=N4gnM5JivzCWsZnRRezQ@mail.gmail.com>
 <20150908122606.GH3644@twins.programming.kicks-ass.net>
 <20150908125205.GW18673@twins.programming.kicks-ass.net>
 <20150908143157.GA27098@e105550-lin.cambridge.arm.com>
 <20150908165331.GC27098@e105550-lin.cambridge.arm.com>
 <20150909094305.GO3644@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150909094305.GO3644@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote:
> > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> 
> > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > > But if we apply the scaling to the weight instead of time, we would only
> > > have to apply it once and not three times like it is now? So maybe we
> > > can end up with almost the same number of multiplications.
> > > 
> > > We might be loosing bits for low priority task running on cpus at a low
> > > frequency though.
> > 
> > Something like the below. We should be saving one multiplication.
> 
> > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  		return 0;
> >  	sa->last_update_time = now;
> >  
> > -	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > -	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +	if (weight || running)
> > +		scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > +	if (weight)
> > +		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
> > +	if (running)
> > +		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
> > +							>> SCHED_CAPACITY_SHIFT;
> >  
> >  	/* delta_w is the amount already accumulated against our next period */
> >  	delta_w = sa->period_contrib;
> > @@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  		 * period and accrue it.
> >  		 */
> >  		delta_w = 1024 - delta_w;
> > -		scaled_delta_w = cap_scale(delta_w, scale_freq);
> >  		if (weight) {
> > -			sa->load_sum += weight * scaled_delta_w;
> > +			sa->load_sum += scaled_weight * delta_w;
> >  			if (cfs_rq) {
> >  				cfs_rq->runnable_load_sum +=
> > -						weight * scaled_delta_w;
> > +						scaled_weight * delta_w;
> >  			}
> >  		}
> >  		if (running)
> > -			sa->util_sum += scaled_delta_w * scale_cpu;
> > +			sa->util_sum += delta_w * scale_freq_cpu;
> >  
> >  		delta -= delta_w;
> >  
> 
> Sadly that makes the code worse; I get 14 mul instructions where
> previously I had 11.
> 
> What happens is that GCC gets confused and cannot constant propagate the
> new variables, so what used to be shifts now end up being actual
> multiplications.
> 
> With this, I get back to 11. Can you see what happens on ARM where you
> have both functions defined to non constants?

We repeated the experiment on arm and arm64 but still with functions
defined to constant to compare with your results. The mul instruction
count seems to be somewhat compiler version dependent, but consistently
show no effect of the patch:

arm	before	after
gcc4.9	12	12
gcc4.8	10	10

arm64	before	after
gcc4.9	11	11

I will get numbers with the arch-functions implemented as well and do
hackbench runs to see what happens in terms of performance.