[PATCH 0/3] cpufreq: Replace timers with utilization update callbacks

LKML Archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
@ 2016-01-29 22:52 Rafael J. Wysocki
  2016-01-29 22:53 ` [PATCH 1/3] cpufreq: Add a mechanism for registering " Rafael J. Wysocki
                   ` (4 more replies)
  0 siblings, 5 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-01-29 22:52 UTC (permalink / raw)
  To: Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

Hi,

The following patch series introduces a mechanism allowing the cpufreq core
and "setpolicy" drivers to provide utilization update callbacks to be invoked
by the scheduler on utilization changes.  Those callbacks can be used to run
the sampling and frequency adjustments code (intel_pstate) or to schedule the
execution of that code in process context (cpufreq core) instead of per-CPU
deferrable timers used in cpufreq today (which Thomas complained about during
the last Kernel Summit).

[1/3] Introduce a mechanism for calling into cpufreq from the scheduler and
      registering callbacks to be executed from there.

[2/3] Modify intel_pstate to use the mechanism introduced by [1/3] instead
      of per-CPU deferrable timers to do its work.

This isn't entirely straightforward as the scheduler context running those
callbacks is really special.  Among other things it can only use raw
spinlocks and cannot invoke wake_up_process() directly.  Also, calling
ktime_get() from there may be too expensive on some systems.  All that has to
be taken into account, but even then the change allows some lines of code to be
cut from the driver.

Some performance and energy consumption measurements have been carried out with
an earlier version of this patch and it looks like the changes lead to a
slightly better performing system that consumes slightly less energy at the
same time overall.

[3/3] Modify the cpufreq core to use the mechanism introduced by [1/3] instead
      of per-CPU deferrable timers to queue up the execution of governor work.

Again, this isn't really straightforward for the above reasons, but still the
code size is reduced a bit by the changes.

I'm still unsure about the energy consumption and performance impact of [3/3]
as earlier versions of it led to inconsistent results (most likely due to bugs
in them that hopefully have been fixed in this version).  In particular, the
additional irq_work may turn out to be problematic, but more optimizations are
possible on top of this one even if it makes things worse by itself.

For example, it should be possible to move the execution of state selection
code into the utilization update callback itself, at least in principle, for
all governors.  The P-state/OPP adjustment may need to be run from process
context still, but for the drivers that can do it without sleeping it should
be possible to move that into the utilization update callback as well.

The patches are on top of 4.5-rc1 and have been tested on a couple of x86
machines.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH 1/3] cpufreq: Add a mechanism for registering utilization update callbacks
  2016-01-29 22:52 [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks Rafael J. Wysocki
@ 2016-01-29 22:53 ` Rafael J. Wysocki
  2016-02-04  3:31   ` Viresh Kumar
  2016-01-29 22:56 ` [PATCH 2/3] cpufreq: intel_pstate: Replace timers with " Rafael J. Wysocki
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-01-29 22:53 UTC (permalink / raw)
  To: Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.

This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.

The scheduler changes were suggested by Peter Zijlstra.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/cpufreq.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   |    7 +++++++
 include/linux/sched.h     |    2 ++
 kernel/sched/fair.c       |   29 ++++++++++++++++++++++++++++-
 4 files changed, 81 insertions(+), 1 deletion(-)

Index: linux-pm/include/linux/sched.h
===================================================================
--- linux-pm.orig/include/linux/sched.h
+++ linux-pm/include/linux/sched.h
@@ -3207,4 +3207,6 @@ static inline unsigned long rlimit_max(u
 	return task_rlimit_max(current, limit);
 }
 
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+
 #endif
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2803,12 +2803,17 @@ static inline int update_cfs_rq_load_avg
 	return decayed || removed;
 }
 
+__weak void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+}
+
 /* Update task and its cfs_rq load average */
 static inline void update_load_avg(struct sched_entity *se, int update_tg)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
-	int cpu = cpu_of(rq_of(cfs_rq));
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -2820,6 +2825,28 @@ static inline void update_load_avg(struc
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
+
+	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+		unsigned long max = rq->cpu_capacity_orig;
+
+		/*
+		 * There are a few boundary cases this might miss but it should
+		 * get called often enough that that should (hopefully) not be
+		 * a real problem -- added to that it only calls on the local
+		 * CPU, so if we enqueue remotely we'll loose an update, but
+		 * the next tick/schedule should update.
+		 *
+		 * It will not get called when we go idle, because the idle
+		 * thread is a different class (!fair), nor will the utilization
+		 * number include things like RT tasks.
+		 *
+		 * As is, the util number is not freq invariant (we'd have to
+		 * implement arch_scale_freq_capacity() for that).
+		 *
+		 * See cpu_util().
+		 */
+		cpufreq_update_util(now, min(cfs_rq->avg.util_avg, max), max);
+	}
 }
 
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -102,6 +102,50 @@ static LIST_HEAD(cpufreq_governor_list);
 static struct cpufreq_driver *cpufreq_driver;
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
 static DEFINE_RWLOCK(cpufreq_driver_lock);
+
+static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @data: New pointer value.
+ *
+ * Set and publish the update_util_data pointer for the given CPU.  That pointer
+ * points to a struct update_util_data object containing a callback function
+ * to call from cpufreq_update_util().  That function will be called from an RCU
+ * read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU callbacks to free any memory that might be accessed
+ * via the old update_util_data pointer or invoke synchronize_rcu() right after
+ * this function to avoid use-after-free.
+ */
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+{
+	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @util: Current utilization.
+ * @max: Utilization ceiling.
+ *
+ * This function is called by the scheduler on every invocation of
+ * update_load_avg() on the CPU whose utilization is being updated.
+ */
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+	struct update_util_data *data;
+
+	rcu_read_lock();
+
+	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
+	if (data && data->func)
+		data->func(data, time, util, max);
+
+	rcu_read_unlock();
+}
+
 DEFINE_MUTEX(cpufreq_governor_lock);
 
 /* Flag to suspend/resume CPUFreq governors */
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -322,6 +322,13 @@ int cpufreq_unregister_driver(struct cpu
 const char *cpufreq_get_current_driver(void);
 void *cpufreq_get_driver_data(void);
 
+struct update_util_data {
+	void (*func)(struct update_util_data *data,
+		     u64 time, unsigned long util, unsigned long max);
+};
+
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+
 static inline void cpufreq_verify_within_limits(struct cpufreq_policy *policy,
 		unsigned int min, unsigned int max)
 {

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH 2/3] cpufreq: intel_pstate: Replace timers with utilization update callbacks
  2016-01-29 22:52 [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks Rafael J. Wysocki
  2016-01-29 22:53 ` [PATCH 1/3] cpufreq: Add a mechanism for registering " Rafael J. Wysocki
@ 2016-01-29 22:56 ` Rafael J. Wysocki
  2016-01-29 22:59 ` [PATCH 3/3] cpufreq: governor: " Rafael J. Wysocki
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-01-29 22:56 UTC (permalink / raw)
  To: Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Instead of using a per-CPU deferrable timer for utilization sampling
and P-states adjustments, register a utilization update callback that
will be invoked from the scheduler on utilization changes.

The sampling rate is still the same as what was used for the deferrable
timers, so the functional impact of this patch should be negligible.

Based on an earlier patch from Srinivas Pandruvada.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/intel_pstate.c |  103 +++++++++++++++--------------------------
 1 file changed, 39 insertions(+), 64 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -71,7 +71,7 @@ struct sample {
 	u64 mperf;
 	u64 tsc;
 	int freq;
-	ktime_t time;
+	u64 time;
 };
 
 struct pstate_data {
@@ -103,13 +103,13 @@ struct _pid {
 struct cpudata {
 	int cpu;
 
-	struct timer_list timer;
+	struct update_util_data update_util;
 
 	struct pstate_data pstate;
 	struct vid_data vid;
 	struct _pid pid;
 
-	ktime_t last_sample_time;
+	u64	last_sample_time;
 	u64	prev_aperf;
 	u64	prev_mperf;
 	u64	prev_tsc;
@@ -120,6 +120,7 @@ struct cpudata {
 static struct cpudata **all_cpu_data;
 struct pstate_adjust_policy {
 	int sample_rate_ms;
+	s64 sample_rate_ns;
 	int deadband;
 	int setpoint;
 	int p_gain_pct;
@@ -712,7 +713,7 @@ static void core_set_pstate(struct cpuda
 	if (limits->no_turbo && !limits->turbo_disabled)
 		val |= (u64)1 << 32;
 
-	wrmsrl_on_cpu(cpudata->cpu, MSR_IA32_PERF_CTL, val);
+	wrmsrl(MSR_IA32_PERF_CTL, val);
 }
 
 static int knl_get_turbo_pstate(void)
@@ -883,7 +884,7 @@ static inline void intel_pstate_calc_bus
 	sample->core_pct_busy = (int32_t)core_pct;
 }
 
-static inline void intel_pstate_sample(struct cpudata *cpu)
+static inline void intel_pstate_sample(struct cpudata *cpu, u64 time)
 {
 	u64 aperf, mperf;
 	unsigned long flags;
@@ -900,7 +901,7 @@ static inline void intel_pstate_sample(s
 	local_irq_restore(flags);
 
 	cpu->last_sample_time = cpu->sample.time;
-	cpu->sample.time = ktime_get();
+	cpu->sample.time = time;
 	cpu->sample.aperf = aperf;
 	cpu->sample.mperf = mperf;
 	cpu->sample.tsc =  tsc;
@@ -915,22 +916,6 @@ static inline void intel_pstate_sample(s
 	cpu->prev_tsc = tsc;
 }
 
-static inline void intel_hwp_set_sample_time(struct cpudata *cpu)
-{
-	int delay;
-
-	delay = msecs_to_jiffies(50);
-	mod_timer_pinned(&cpu->timer, jiffies + delay);
-}
-
-static inline void intel_pstate_set_sample_time(struct cpudata *cpu)
-{
-	int delay;
-
-	delay = msecs_to_jiffies(pid_params.sample_rate_ms);
-	mod_timer_pinned(&cpu->timer, jiffies + delay);
-}
-
 static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
 {
 	struct sample *sample = &cpu->sample;
@@ -970,8 +955,7 @@ static inline int32_t get_target_pstate_
 static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
 {
 	int32_t core_busy, max_pstate, current_pstate, sample_ratio;
-	s64 duration_us;
-	u32 sample_time;
+	u64 duration_ns;
 
 	/*
 	 * core_busy is the ratio of actual performance to max
@@ -990,18 +974,16 @@ static inline int32_t get_target_pstate_
 	core_busy = mul_fp(core_busy, div_fp(max_pstate, current_pstate));
 
 	/*
-	 * Since we have a deferred timer, it will not fire unless
-	 * we are in C0.  So, determine if the actual elapsed time
-	 * is significantly greater (3x) than our sample interval.  If it
-	 * is, then we were idle for a long enough period of time
-	 * to adjust our busyness.
+	 * Since our utilization update callback will not run unless we are
+	 * in C0, check if the actual elapsed time is significantly greater (3x)
+	 * than our sample interval.  If it is, then we were idle for a long
+	 * enough period of time to adjust our busyness.
 	 */
-	sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
-	duration_us = ktime_us_delta(cpu->sample.time,
-				     cpu->last_sample_time);
-	if (duration_us > sample_time * 3) {
-		sample_ratio = div_fp(int_tofp(sample_time),
-				      int_tofp(duration_us));
+	duration_ns = cpu->sample.time - cpu->last_sample_time;
+	if ((s64)duration_ns > pid_params.sample_rate_ns * 3
+	    && cpu->last_sample_time > 0) {
+		sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns),
+				      int_tofp(duration_ns));
 		core_busy = mul_fp(core_busy, sample_ratio);
 	}
 
@@ -1031,23 +1013,17 @@ static inline void intel_pstate_adjust_b
 		sample->freq);
 }
 
-static void intel_hwp_timer_func(unsigned long __data)
-{
-	struct cpudata *cpu = (struct cpudata *) __data;
-
-	intel_pstate_sample(cpu);
-	intel_hwp_set_sample_time(cpu);
-}
-
-static void intel_pstate_timer_func(unsigned long __data)
+static void intel_pstate_update_util(struct update_util_data *data, u64 time,
+				     unsigned long util, unsigned long max)
 {
-	struct cpudata *cpu = (struct cpudata *) __data;
-
-	intel_pstate_sample(cpu);
+	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+	u64 delta_ns = time - cpu->sample.time;
 
-	intel_pstate_adjust_busy_pstate(cpu);
-
-	intel_pstate_set_sample_time(cpu);
+	if ((s64)delta_ns >= pid_params.sample_rate_ns) {
+		intel_pstate_sample(cpu, time);
+		if (!hwp_active)
+			intel_pstate_adjust_busy_pstate(cpu);
+	}
 }
 
 #define ICPU(model, policy) \
@@ -1095,24 +1071,19 @@ static int intel_pstate_init_cpu(unsigne
 
 	cpu->cpu = cpunum;
 
-	if (hwp_active)
+	if (hwp_active) {
 		intel_pstate_hwp_enable(cpu);
+		pid_params.sample_rate_ms = 50;
+		pid_params.sample_rate_ns = 50 * NSEC_PER_MSEC;
+	}
 
 	intel_pstate_get_cpu_pstates(cpu);
 
-	init_timer_deferrable(&cpu->timer);
-	cpu->timer.data = (unsigned long)cpu;
-	cpu->timer.expires = jiffies + HZ/100;
-
-	if (!hwp_active)
-		cpu->timer.function = intel_pstate_timer_func;
-	else
-		cpu->timer.function = intel_hwp_timer_func;
-
 	intel_pstate_busy_pid_reset(cpu);
-	intel_pstate_sample(cpu);
+	intel_pstate_sample(cpu, 0);
 
-	add_timer_on(&cpu->timer, cpunum);
+	cpu->update_util.func = intel_pstate_update_util;
+	cpufreq_set_update_util_data(cpunum, &cpu->update_util);
 
 	pr_debug("intel_pstate: controlling: cpu %d\n", cpunum);
 
@@ -1196,7 +1167,9 @@ static void intel_pstate_stop_cpu(struct
 
 	pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
 
-	del_timer_sync(&all_cpu_data[cpu_num]->timer);
+	cpufreq_set_update_util_data(cpu_num, NULL);
+	synchronize_rcu();
+
 	if (hwp_active)
 		return;
 
@@ -1260,6 +1233,7 @@ static int intel_pstate_msrs_not_valid(v
 static void copy_pid_params(struct pstate_adjust_policy *policy)
 {
 	pid_params.sample_rate_ms = policy->sample_rate_ms;
+	pid_params.sample_rate_ns = pid_params.sample_rate_ms * NSEC_PER_MSEC;
 	pid_params.p_gain_pct = policy->p_gain_pct;
 	pid_params.i_gain_pct = policy->i_gain_pct;
 	pid_params.d_gain_pct = policy->d_gain_pct;
@@ -1451,7 +1425,8 @@ out:
 	get_online_cpus();
 	for_each_online_cpu(cpu) {
 		if (all_cpu_data[cpu]) {
-			del_timer_sync(&all_cpu_data[cpu]->timer);
+			cpufreq_set_update_util_data(cpu, NULL);
+			synchronize_rcu();
 			kfree(all_cpu_data[cpu]);
 		}
 	}

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH 3/3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-01-29 22:52 [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks Rafael J. Wysocki
  2016-01-29 22:53 ` [PATCH 1/3] cpufreq: Add a mechanism for registering " Rafael J. Wysocki
  2016-01-29 22:56 ` [PATCH 2/3] cpufreq: intel_pstate: Replace timers with " Rafael J. Wysocki
@ 2016-01-29 22:59 ` Rafael J. Wysocki
  2016-02-03  1:16   ` [Update][PATCH " Rafael J. Wysocki
  2016-02-03 22:20 ` [PATCH 0/3] cpufreq: " Rafael J. Wysocki
  2016-02-10 15:17 ` [PATCH v6 " Rafael J. Wysocki
  4 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-01-29 22:59 UTC (permalink / raw)
  To: Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Instead of using a per-CPU deferrable timer for queuing up governor
work items, register a utilization update callback that will be
invoked from the scheduler on utilization changes.

The sampling rate is still the same as what was used for the
deferrable timers and the added irq_work overhead should be offset by
the eliminated timers overhead, so in theory the functional impact of
this patch should not be significant.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/cpufreq_conservative.c |    6 -
 drivers/cpufreq/cpufreq_governor.c     |  129 +++++++++++++++------------------
 drivers/cpufreq/cpufreq_governor.h     |   13 ++-
 drivers/cpufreq/cpufreq_ondemand.c     |   25 +++---
 4 files changed, 81 insertions(+), 92 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -18,6 +18,7 @@
 #define _CPUFREQ_GOVERNOR_H
 
 #include <linux/atomic.h>
+#include <linux/irq_work.h>
 #include <linux/cpufreq.h>
 #include <linux/kernel_stat.h>
 #include <linux/module.h>
@@ -139,7 +140,9 @@ struct cpu_common_dbs_info {
 	struct mutex timer_mutex;
 
 	ktime_t time_stamp;
+	s64 sample_delay_ns;
 	atomic_t skip_work;
+	struct irq_work irq_work;
 	struct work_struct work;
 };
 
@@ -155,7 +158,8 @@ struct cpu_dbs_info {
 	 * wake-up from idle.
 	 */
 	unsigned int prev_load;
-	struct timer_list timer;
+	u64 last_sample_time;
+	struct update_util_data update_util;
 	struct cpu_common_dbs_info *shared;
 };
 
@@ -212,8 +216,7 @@ struct common_dbs_data {
 
 	struct cpu_dbs_info *(*get_cpu_cdbs)(int cpu);
 	void *(*get_cpu_dbs_info_s)(int cpu);
-	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy,
-				      bool modify_all);
+	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy);
 	void (*gov_check_cpu)(int cpu, unsigned int load);
 	int (*init)(struct dbs_data *dbs_data, bool notify);
 	void (*exit)(struct dbs_data *dbs_data, bool notify);
@@ -270,8 +273,8 @@ static ssize_t show_sampling_rate_min_go
 }
 
 extern struct mutex cpufreq_governor_lock;
-
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay);
+void gov_set_update_util(struct cpu_common_dbs_info *shared,
+			 unsigned int delay_us);
 void gov_cancel_work(struct cpu_common_dbs_info *shared);
 void dbs_check_cpu(struct dbs_data *dbs_data, int cpu);
 int cpufreq_governor_dbs(struct cpufreq_policy *policy,
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -128,10 +128,10 @@ void dbs_check_cpu(struct dbs_data *dbs_
 		 * dropped down. So we perform the copy only once, upon the
 		 * first wake-up from idle.)
 		 *
-		 * Detecting this situation is easy: the governor's deferrable
-		 * timer would not have fired during CPU-idle periods. Hence
-		 * an unusually large 'wall_time' (as compared to the sampling
-		 * rate) indicates this scenario.
+		 * Detecting this situation is easy: the governor's utilization
+		 * update handler would not have run during CPU-idle periods.
+		 * Hence, an unusually large 'wall_time' (as compared to the
+		 * sampling rate) indicates this scenario.
 		 *
 		 * prev_load can be zero in two cases and we must recalculate it
 		 * for both cases:
@@ -161,21 +161,26 @@ void dbs_check_cpu(struct dbs_data *dbs_
 }
 EXPORT_SYMBOL_GPL(dbs_check_cpu);
 
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay)
+void gov_set_update_util(struct cpu_common_dbs_info *shared,
+			 unsigned int delay_us)
 {
+	struct cpufreq_policy *policy = shared->policy;
 	struct dbs_data *dbs_data = policy->governor_data;
-	struct cpu_dbs_info *cdbs;
 	int cpu;
 
+	shared->sample_delay_ns = delay_us * NSEC_PER_USEC;
+	shared->time_stamp = ktime_get();
+
 	for_each_cpu(cpu, policy->cpus) {
-		cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
-		cdbs->timer.expires = jiffies + delay;
-		add_timer_on(&cdbs->timer, cpu);
+		struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
+
+		cdbs->last_sample_time = 0;
+		cpufreq_set_update_util_data(cpu, &cdbs->update_util);
 	}
 }
-EXPORT_SYMBOL_GPL(gov_add_timers);
+EXPORT_SYMBOL_GPL(gov_set_update_util);
 
-static inline void gov_cancel_timers(struct cpufreq_policy *policy)
+static inline void gov_clear_update_util_data(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	struct cpu_dbs_info *cdbs;
@@ -183,51 +188,26 @@ static inline void gov_cancel_timers(str
 
 	for_each_cpu(i, policy->cpus) {
 		cdbs = dbs_data->cdata->get_cpu_cdbs(i);
-		del_timer_sync(&cdbs->timer);
+		cpufreq_set_update_util_data(i, NULL);
 	}
+	synchronize_rcu();
 }
 
 void gov_cancel_work(struct cpu_common_dbs_info *shared)
 {
-	/* Tell dbs_timer_handler() to skip queuing up work items. */
+	/* Tell dbs_update_util_handler() to skip queuing up work items. */
 	atomic_inc(&shared->skip_work);
 	/*
-	 * If dbs_timer_handler() is already running, it may not notice the
-	 * incremented skip_work, so wait for it to complete to prevent its work
-	 * item from being queued up after the cancel_work_sync() below.
-	 */
-	gov_cancel_timers(shared->policy);
-	/*
-	 * In case dbs_timer_handler() managed to run and spawn a work item
-	 * before the timers have been canceled, wait for that work item to
-	 * complete and then cancel all of the timers set up by it.  If
-	 * dbs_timer_handler() runs again at that point, it will see the
-	 * positive value of skip_work and won't spawn any more work items.
+	 * If dbs_update_util_handler() is already running, it may not notice
+	 * the incremented skip_work, so wait for it to complete to prevent its
+	 * work item from being queued up after the cancel_work_sync() below.
 	 */
+	gov_clear_update_util_data(shared->policy);
 	cancel_work_sync(&shared->work);
-	gov_cancel_timers(shared->policy);
 	atomic_set(&shared->skip_work, 0);
 }
 EXPORT_SYMBOL_GPL(gov_cancel_work);
 
-/* Will return if we need to evaluate cpu load again or not */
-static bool need_load_eval(struct cpu_common_dbs_info *shared,
-			   unsigned int sampling_rate)
-{
-	if (policy_is_shared(shared->policy)) {
-		ktime_t time_now = ktime_get();
-		s64 delta_us = ktime_us_delta(time_now, shared->time_stamp);
-
-		/* Do nothing if we recently have sampled */
-		if (delta_us < (s64)(sampling_rate / 2))
-			return false;
-		else
-			shared->time_stamp = time_now;
-	}
-
-	return true;
-}
-
 static void dbs_work_handler(struct work_struct *work)
 {
 	struct cpu_common_dbs_info *shared = container_of(work, struct
@@ -235,14 +215,10 @@ static void dbs_work_handler(struct work
 	struct cpufreq_policy *policy;
 	struct dbs_data *dbs_data;
 	unsigned int sampling_rate, delay;
-	bool eval_load;
 
 	policy = shared->policy;
 	dbs_data = policy->governor_data;
 
-	/* Kill all timers */
-	gov_cancel_timers(policy);
-
 	if (dbs_data->cdata->governor == GOV_CONSERVATIVE) {
 		struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
@@ -253,37 +229,53 @@ static void dbs_work_handler(struct work
 		sampling_rate = od_tuners->sampling_rate;
 	}
 
-	eval_load = need_load_eval(shared, sampling_rate);
-
 	/*
-	 * Make sure cpufreq_governor_limits() isn't evaluating load in
+	 * Make sure cpufreq_governor_limits() isn't evaluating load or the
+	 * ondemand governor isn't reading the time stamp and sampling rate in
 	 * parallel.
 	 */
 	mutex_lock(&shared->timer_mutex);
-	delay = dbs_data->cdata->gov_dbs_timer(policy, eval_load);
+	delay = dbs_data->cdata->gov_dbs_timer(policy);
+	shared->sample_delay_ns = jiffies_to_nsecs(delay);
+	shared->time_stamp = ktime_get();
 	mutex_unlock(&shared->timer_mutex);
 
+	smp_mb__before_atomic();
 	atomic_dec(&shared->skip_work);
+}
+
+static void dbs_irq_work(struct irq_work *irq_work)
+{
+	struct cpu_common_dbs_info *shared;
 
-	gov_add_timers(policy, delay);
+	shared = container_of(irq_work, struct cpu_common_dbs_info, irq_work);
+	schedule_work(&shared->work);
 }
 
-static void dbs_timer_handler(unsigned long data)
+static void dbs_update_util_handler(struct update_util_data *data, u64 time,
+				    unsigned long util, unsigned long max)
 {
-	struct cpu_dbs_info *cdbs = (struct cpu_dbs_info *)data;
+	struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
 	struct cpu_common_dbs_info *shared = cdbs->shared;
 
 	/*
-	 * Timer handler may not be allowed to queue the work at the moment,
-	 * because:
-	 * - Another timer handler has done that
-	 * - We are stopping the governor
-	 * - Or we are updating the sampling rate of the ondemand governor
+	 * The work may not be allowed to be queued up right now.
+	 * Possible reasons:
+	 * - Work has already been queued up or is in progress.
+	 * - The governor is being stopped.
+	 * - It is too early (too little time from the previous sample).
 	 */
-	if (atomic_inc_return(&shared->skip_work) > 1)
-		atomic_dec(&shared->skip_work);
-	else
-		queue_work(system_wq, &shared->work);
+	if (atomic_inc_return(&shared->skip_work) == 1) {
+		u64 delta_ns;
+
+		delta_ns = time - cdbs->last_sample_time;
+		if ((s64)delta_ns >= shared->sample_delay_ns) {
+			cdbs->last_sample_time = time;
+			irq_work_queue_on(&shared->irq_work, smp_processor_id());
+			return;
+		}
+	}
+	atomic_dec(&shared->skip_work);
 }
 
 static void set_sampling_rate(struct dbs_data *dbs_data,
@@ -462,9 +454,6 @@ static int cpufreq_governor_start(struct
 		io_busy = od_tuners->io_is_busy;
 	}
 
-	shared->policy = policy;
-	shared->time_stamp = ktime_get();
-
 	for_each_cpu(j, policy->cpus) {
 		struct cpu_dbs_info *j_cdbs = cdata->get_cpu_cdbs(j);
 		unsigned int prev_load;
@@ -480,10 +469,10 @@ static int cpufreq_governor_start(struct
 		if (ignore_nice)
 			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 
-		__setup_timer(&j_cdbs->timer, dbs_timer_handler,
-			      (unsigned long)j_cdbs,
-			      TIMER_DEFERRABLE | TIMER_IRQSAFE);
+		j_cdbs->update_util.func = dbs_update_util_handler;
 	}
+	shared->policy = policy;
+	init_irq_work(&shared->irq_work, dbs_irq_work);
 
 	if (cdata->governor == GOV_CONSERVATIVE) {
 		struct cs_cpu_dbs_info_s *cs_dbs_info =
@@ -500,7 +489,7 @@ static int cpufreq_governor_start(struct
 		od_ops->powersave_bias_init_cpu(cpu);
 	}
 
-	gov_add_timers(policy, delay_for_sampling_rate(sampling_rate));
+	gov_set_update_util(shared, sampling_rate);
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -191,7 +191,7 @@ static void od_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int od_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int od_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	unsigned int cpu = policy->cpu;
@@ -200,9 +200,6 @@ static unsigned int od_dbs_timer(struct
 	struct od_dbs_tuners *od_tuners = dbs_data->tuners;
 	int delay = 0, sample_type = dbs_info->sample_type;
 
-	if (!modify_all)
-		goto max_delay;
-
 	/* Common NORMAL_SAMPLE setup */
 	dbs_info->sample_type = OD_NORMAL_SAMPLE;
 	if (sample_type == OD_SUB_SAMPLE) {
@@ -218,7 +215,6 @@ static unsigned int od_dbs_timer(struct
 		}
 	}
 
-max_delay:
 	if (!delay)
 		delay = delay_for_sampling_rate(od_tuners->sampling_rate
 				* dbs_info->rate_mult);
@@ -264,7 +260,7 @@ static void update_sampling_rate(struct
 		struct od_cpu_dbs_info_s *dbs_info;
 		struct cpu_dbs_info *cdbs;
 		struct cpu_common_dbs_info *shared;
-		unsigned long next_sampling, appointed_at;
+		ktime_t next_sampling, appointed_at;
 
 		dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
 		cdbs = &dbs_info->cdbs;
@@ -292,16 +288,19 @@ static void update_sampling_rate(struct
 			continue;
 
 		/*
-		 * Checking this for any CPU should be fine, timers for all of
-		 * them are scheduled together.
+		 * Checking this for any CPU sharing the policy should be fine,
+		 * they are all scheduled to sample at the same time.
 		 */
-		next_sampling = jiffies + usecs_to_jiffies(new_rate);
-		appointed_at = dbs_info->cdbs.timer.expires;
+		next_sampling = ktime_add_us(ktime_get(), new_rate);
 
-		if (time_before(next_sampling, appointed_at)) {
-			gov_cancel_work(shared);
-			gov_add_timers(policy, usecs_to_jiffies(new_rate));
+		mutex_lock(&shared->timer_mutex);
+		appointed_at = ktime_add_ns(shared->time_stamp,
+					    shared->sample_delay_ns);
+		mutex_unlock(&shared->timer_mutex);
 
+		if (ktime_before(next_sampling, appointed_at)) {
+			gov_cancel_work(shared);
+			gov_set_update_util(shared, new_rate);
 		}
 	}
 
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -115,14 +115,12 @@ static void cs_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int cs_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int cs_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
-	if (modify_all)
-		dbs_check_cpu(dbs_data, policy->cpu);
-
+	dbs_check_cpu(dbs_data, policy->cpu);
 	return delay_for_sampling_rate(cs_tuners->sampling_rate);
 }
 

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [Update][PATCH 3/3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-01-29 22:59 ` [PATCH 3/3] cpufreq: governor: " Rafael J. Wysocki
@ 2016-02-03  1:16   ` Rafael J. Wysocki
  2016-02-04  4:49     ` Viresh Kumar
  2016-02-05  1:28     ` [PATCH 3/3 v3] " Rafael J. Wysocki
  0 siblings, 2 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-03  1:16 UTC (permalink / raw)
  To: Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] cpufreq: governor: Replace timers with utilization update callbacks

Instead of using a per-CPU deferrable timer for queuing up governor
work items, register a utilization update callback that will be
invoked from the scheduler on utilization changes.

The sampling rate is still the same as what was used for the
deferrable timers and the added irq_work overhead should be offset by
the eliminated timers overhead, so in theory the functional impact of
this patch should not be significant.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

I realized that the previous version of this patch didn't remove some code
that wasn't necessary any more, so here's an update.

---
 drivers/cpufreq/cpufreq_conservative.c |    6 -
 drivers/cpufreq/cpufreq_governor.c     |  136 ++++++++++++++-------------------
 drivers/cpufreq/cpufreq_governor.h     |   13 +--
 drivers/cpufreq/cpufreq_ondemand.c     |   25 ++----
 4 files changed, 83 insertions(+), 97 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -18,6 +18,7 @@
 #define _CPUFREQ_GOVERNOR_H
 
 #include <linux/atomic.h>
+#include <linux/irq_work.h>
 #include <linux/cpufreq.h>
 #include <linux/kernel_stat.h>
 #include <linux/module.h>
@@ -139,7 +140,9 @@ struct cpu_common_dbs_info {
 	struct mutex timer_mutex;
 
 	ktime_t time_stamp;
+	s64 sample_delay_ns;
 	atomic_t skip_work;
+	struct irq_work irq_work;
 	struct work_struct work;
 };
 
@@ -155,7 +158,8 @@ struct cpu_dbs_info {
 	 * wake-up from idle.
 	 */
 	unsigned int prev_load;
-	struct timer_list timer;
+	u64 last_sample_time;
+	struct update_util_data update_util;
 	struct cpu_common_dbs_info *shared;
 };
 
@@ -212,8 +216,7 @@ struct common_dbs_data {
 
 	struct cpu_dbs_info *(*get_cpu_cdbs)(int cpu);
 	void *(*get_cpu_dbs_info_s)(int cpu);
-	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy,
-				      bool modify_all);
+	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy);
 	void (*gov_check_cpu)(int cpu, unsigned int load);
 	int (*init)(struct dbs_data *dbs_data, bool notify);
 	void (*exit)(struct dbs_data *dbs_data, bool notify);
@@ -270,8 +273,8 @@ static ssize_t show_sampling_rate_min_go
 }
 
 extern struct mutex cpufreq_governor_lock;
-
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay);
+void gov_set_update_util(struct cpu_common_dbs_info *shared,
+			 unsigned int delay_us);
 void gov_cancel_work(struct cpu_common_dbs_info *shared);
 void dbs_check_cpu(struct dbs_data *dbs_data, int cpu);
 int cpufreq_governor_dbs(struct cpufreq_policy *policy,
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -128,10 +128,10 @@ void dbs_check_cpu(struct dbs_data *dbs_
 		 * dropped down. So we perform the copy only once, upon the
 		 * first wake-up from idle.)
 		 *
-		 * Detecting this situation is easy: the governor's deferrable
-		 * timer would not have fired during CPU-idle periods. Hence
-		 * an unusually large 'wall_time' (as compared to the sampling
-		 * rate) indicates this scenario.
+		 * Detecting this situation is easy: the governor's utilization
+		 * update handler would not have run during CPU-idle periods.
+		 * Hence, an unusually large 'wall_time' (as compared to the
+		 * sampling rate) indicates this scenario.
 		 *
 		 * prev_load can be zero in two cases and we must recalculate it
 		 * for both cases:
@@ -161,73 +161,50 @@ void dbs_check_cpu(struct dbs_data *dbs_
 }
 EXPORT_SYMBOL_GPL(dbs_check_cpu);
 
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay)
+void gov_set_update_util(struct cpu_common_dbs_info *shared,
+			 unsigned int delay_us)
 {
+	struct cpufreq_policy *policy = shared->policy;
 	struct dbs_data *dbs_data = policy->governor_data;
-	struct cpu_dbs_info *cdbs;
 	int cpu;
 
+	shared->sample_delay_ns = delay_us * NSEC_PER_USEC;
+	shared->time_stamp = ktime_get();
+
 	for_each_cpu(cpu, policy->cpus) {
-		cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
-		cdbs->timer.expires = jiffies + delay;
-		add_timer_on(&cdbs->timer, cpu);
+		struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
+
+		cdbs->last_sample_time = 0;
+		cpufreq_set_update_util_data(cpu, &cdbs->update_util);
 	}
 }
-EXPORT_SYMBOL_GPL(gov_add_timers);
+EXPORT_SYMBOL_GPL(gov_set_update_util);
 
-static inline void gov_cancel_timers(struct cpufreq_policy *policy)
+static inline void gov_clear_update_util(struct cpufreq_policy *policy)
 {
-	struct dbs_data *dbs_data = policy->governor_data;
-	struct cpu_dbs_info *cdbs;
 	int i;
 
-	for_each_cpu(i, policy->cpus) {
-		cdbs = dbs_data->cdata->get_cpu_cdbs(i);
-		del_timer_sync(&cdbs->timer);
-	}
+	for_each_cpu(i, policy->cpus)
+		cpufreq_set_update_util_data(i, NULL);
+
+	synchronize_rcu();
 }
 
 void gov_cancel_work(struct cpu_common_dbs_info *shared)
 {
-	/* Tell dbs_timer_handler() to skip queuing up work items. */
+	/* Tell dbs_update_util_handler() to skip queuing up work items. */
 	atomic_inc(&shared->skip_work);
 	/*
-	 * If dbs_timer_handler() is already running, it may not notice the
-	 * incremented skip_work, so wait for it to complete to prevent its work
-	 * item from being queued up after the cancel_work_sync() below.
-	 */
-	gov_cancel_timers(shared->policy);
-	/*
-	 * In case dbs_timer_handler() managed to run and spawn a work item
-	 * before the timers have been canceled, wait for that work item to
-	 * complete and then cancel all of the timers set up by it.  If
-	 * dbs_timer_handler() runs again at that point, it will see the
-	 * positive value of skip_work and won't spawn any more work items.
+	 * If dbs_update_util_handler() is already running, it may not notice
+	 * the incremented skip_work, so wait for it to complete to prevent its
+	 * work item from being queued up after the cancel_work_sync() below.
 	 */
+	gov_clear_update_util(shared->policy);
 	cancel_work_sync(&shared->work);
-	gov_cancel_timers(shared->policy);
 	atomic_set(&shared->skip_work, 0);
 }
 EXPORT_SYMBOL_GPL(gov_cancel_work);
 
-/* Will return if we need to evaluate cpu load again or not */
-static bool need_load_eval(struct cpu_common_dbs_info *shared,
-			   unsigned int sampling_rate)
-{
-	if (policy_is_shared(shared->policy)) {
-		ktime_t time_now = ktime_get();
-		s64 delta_us = ktime_us_delta(time_now, shared->time_stamp);
-
-		/* Do nothing if we recently have sampled */
-		if (delta_us < (s64)(sampling_rate / 2))
-			return false;
-		else
-			shared->time_stamp = time_now;
-	}
-
-	return true;
-}
-
 static void dbs_work_handler(struct work_struct *work)
 {
 	struct cpu_common_dbs_info *shared = container_of(work, struct
@@ -235,14 +212,10 @@ static void dbs_work_handler(struct work
 	struct cpufreq_policy *policy;
 	struct dbs_data *dbs_data;
 	unsigned int sampling_rate, delay;
-	bool eval_load;
 
 	policy = shared->policy;
 	dbs_data = policy->governor_data;
 
-	/* Kill all timers */
-	gov_cancel_timers(policy);
-
 	if (dbs_data->cdata->governor == GOV_CONSERVATIVE) {
 		struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
@@ -253,37 +226,53 @@ static void dbs_work_handler(struct work
 		sampling_rate = od_tuners->sampling_rate;
 	}
 
-	eval_load = need_load_eval(shared, sampling_rate);
-
 	/*
-	 * Make sure cpufreq_governor_limits() isn't evaluating load in
+	 * Make sure cpufreq_governor_limits() isn't evaluating load or the
+	 * ondemand governor isn't reading the time stamp and sampling rate in
 	 * parallel.
 	 */
 	mutex_lock(&shared->timer_mutex);
-	delay = dbs_data->cdata->gov_dbs_timer(policy, eval_load);
+	delay = dbs_data->cdata->gov_dbs_timer(policy);
+	shared->sample_delay_ns = jiffies_to_nsecs(delay);
+	shared->time_stamp = ktime_get();
 	mutex_unlock(&shared->timer_mutex);
 
+	smp_mb__before_atomic();
 	atomic_dec(&shared->skip_work);
+}
 
-	gov_add_timers(policy, delay);
+static void dbs_irq_work(struct irq_work *irq_work)
+{
+	struct cpu_common_dbs_info *shared;
+
+	shared = container_of(irq_work, struct cpu_common_dbs_info, irq_work);
+	schedule_work(&shared->work);
 }
 
-static void dbs_timer_handler(unsigned long data)
+static void dbs_update_util_handler(struct update_util_data *data, u64 time,
+				    unsigned long util, unsigned long max)
 {
-	struct cpu_dbs_info *cdbs = (struct cpu_dbs_info *)data;
+	struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
 	struct cpu_common_dbs_info *shared = cdbs->shared;
 
 	/*
-	 * Timer handler may not be allowed to queue the work at the moment,
-	 * because:
-	 * - Another timer handler has done that
-	 * - We are stopping the governor
-	 * - Or we are updating the sampling rate of the ondemand governor
+	 * The work may not be allowed to be queued up right now.
+	 * Possible reasons:
+	 * - Work has already been queued up or is in progress.
+	 * - The governor is being stopped.
+	 * - It is too early (too little time from the previous sample).
 	 */
-	if (atomic_inc_return(&shared->skip_work) > 1)
-		atomic_dec(&shared->skip_work);
-	else
-		queue_work(system_wq, &shared->work);
+	if (atomic_inc_return(&shared->skip_work) == 1) {
+		u64 delta_ns;
+
+		delta_ns = time - cdbs->last_sample_time;
+		if ((s64)delta_ns >= shared->sample_delay_ns) {
+			cdbs->last_sample_time = time;
+			irq_work_queue_on(&shared->irq_work, smp_processor_id());
+			return;
+		}
+	}
+	atomic_dec(&shared->skip_work);
 }
 
 static void set_sampling_rate(struct dbs_data *dbs_data,
@@ -467,9 +456,6 @@ static int cpufreq_governor_start(struct
 		io_busy = od_tuners->io_is_busy;
 	}
 
-	shared->policy = policy;
-	shared->time_stamp = ktime_get();
-
 	for_each_cpu(j, policy->cpus) {
 		struct cpu_dbs_info *j_cdbs = cdata->get_cpu_cdbs(j);
 		unsigned int prev_load;
@@ -485,10 +471,10 @@ static int cpufreq_governor_start(struct
 		if (ignore_nice)
 			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 
-		__setup_timer(&j_cdbs->timer, dbs_timer_handler,
-			      (unsigned long)j_cdbs,
-			      TIMER_DEFERRABLE | TIMER_IRQSAFE);
+		j_cdbs->update_util.func = dbs_update_util_handler;
 	}
+	shared->policy = policy;
+	init_irq_work(&shared->irq_work, dbs_irq_work);
 
 	if (cdata->governor == GOV_CONSERVATIVE) {
 		struct cs_cpu_dbs_info_s *cs_dbs_info =
@@ -505,7 +491,7 @@ static int cpufreq_governor_start(struct
 		od_ops->powersave_bias_init_cpu(cpu);
 	}
 
-	gov_add_timers(policy, delay_for_sampling_rate(sampling_rate));
+	gov_set_update_util(shared, sampling_rate);
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -191,7 +191,7 @@ static void od_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int od_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int od_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	unsigned int cpu = policy->cpu;
@@ -200,9 +200,6 @@ static unsigned int od_dbs_timer(struct
 	struct od_dbs_tuners *od_tuners = dbs_data->tuners;
 	int delay = 0, sample_type = dbs_info->sample_type;
 
-	if (!modify_all)
-		goto max_delay;
-
 	/* Common NORMAL_SAMPLE setup */
 	dbs_info->sample_type = OD_NORMAL_SAMPLE;
 	if (sample_type == OD_SUB_SAMPLE) {
@@ -218,7 +215,6 @@ static unsigned int od_dbs_timer(struct
 		}
 	}
 
-max_delay:
 	if (!delay)
 		delay = delay_for_sampling_rate(od_tuners->sampling_rate
 				* dbs_info->rate_mult);
@@ -264,7 +260,7 @@ static void update_sampling_rate(struct
 		struct od_cpu_dbs_info_s *dbs_info;
 		struct cpu_dbs_info *cdbs;
 		struct cpu_common_dbs_info *shared;
-		unsigned long next_sampling, appointed_at;
+		ktime_t next_sampling, appointed_at;
 
 		dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
 		cdbs = &dbs_info->cdbs;
@@ -292,16 +288,19 @@ static void update_sampling_rate(struct
 			continue;
 
 		/*
-		 * Checking this for any CPU should be fine, timers for all of
-		 * them are scheduled together.
+		 * Checking this for any CPU sharing the policy should be fine,
+		 * they are all scheduled to sample at the same time.
 		 */
-		next_sampling = jiffies + usecs_to_jiffies(new_rate);
-		appointed_at = dbs_info->cdbs.timer.expires;
+		next_sampling = ktime_add_us(ktime_get(), new_rate);
 
-		if (time_before(next_sampling, appointed_at)) {
-			gov_cancel_work(shared);
-			gov_add_timers(policy, usecs_to_jiffies(new_rate));
+		mutex_lock(&shared->timer_mutex);
+		appointed_at = ktime_add_ns(shared->time_stamp,
+					    shared->sample_delay_ns);
+		mutex_unlock(&shared->timer_mutex);
 
+		if (ktime_before(next_sampling, appointed_at)) {
+			gov_cancel_work(shared);
+			gov_set_update_util(shared, new_rate);
 		}
 	}
 
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -115,14 +115,12 @@ static void cs_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int cs_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int cs_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
-	if (modify_all)
-		dbs_check_cpu(dbs_data, policy->cpu);
-
+	dbs_check_cpu(dbs_data, policy->cpu);
 	return delay_for_sampling_rate(cs_tuners->sampling_rate);
 }
 

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-01-29 22:52 [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2016-01-29 22:59 ` [PATCH 3/3] cpufreq: governor: " Rafael J. Wysocki
@ 2016-02-03 22:20 ` Rafael J. Wysocki
  2016-02-04  0:08   ` Srinivas Pandruvada
                     ` (2 more replies)
  2016-02-10 15:17 ` [PATCH v6 " Rafael J. Wysocki
  4 siblings, 3 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-03 22:20 UTC (permalink / raw)
  To: Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
> Hi,
> 
> The following patch series introduces a mechanism allowing the cpufreq core
> and "setpolicy" drivers to provide utilization update callbacks to be invoked
> by the scheduler on utilization changes.  Those callbacks can be used to run
> the sampling and frequency adjustments code (intel_pstate) or to schedule the
> execution of that code in process context (cpufreq core) instead of per-CPU
> deferrable timers used in cpufreq today (which Thomas complained about during
> the last Kernel Summit).
> 
> [1/3] Introduce a mechanism for calling into cpufreq from the scheduler and
>       registering callbacks to be executed from there.
> 
> [2/3] Modify intel_pstate to use the mechanism introduced by [1/3] instead
>       of per-CPU deferrable timers to do its work.
> 
> This isn't entirely straightforward as the scheduler context running those
> callbacks is really special.  Among other things it can only use raw
> spinlocks and cannot invoke wake_up_process() directly.  Also, calling
> ktime_get() from there may be too expensive on some systems.  All that has to
> be taken into account, but even then the change allows some lines of code to be
> cut from the driver.
> 
> Some performance and energy consumption measurements have been carried out with
> an earlier version of this patch and it looks like the changes lead to a
> slightly better performing system that consumes slightly less energy at the
> same time overall.
> 
> [3/3] Modify the cpufreq core to use the mechanism introduced by [1/3] instead
>       of per-CPU deferrable timers to queue up the execution of governor work.
> 
> Again, this isn't really straightforward for the above reasons, but still the
> code size is reduced a bit by the changes.
> 
> I'm still unsure about the energy consumption and performance impact of [3/3]
> as earlier versions of it led to inconsistent results (most likely due to bugs
> in them that hopefully have been fixed in this version).  In particular, the
> additional irq_work may turn out to be problematic, but more optimizations are
> possible on top of this one even if it makes things worse by itself.
> 
> For example, it should be possible to move the execution of state selection
> code into the utilization update callback itself, at least in principle, for
> all governors.  The P-state/OPP adjustment may need to be run from process
> context still, but for the drivers that can do it without sleeping it should
> be possible to move that into the utilization update callback as well.
> 
> The patches are on top of 4.5-rc1 and have been tested on a couple of x86
> machines.

Well, no responses here, so I'm inclined to believe that this series is fine
by everybody (at least by everybody in the CC).

I can wait for a few days more, but new material is starting to pile up on top
of these patches and I'll simply need to move forward at one point.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-03 22:20 ` [PATCH 0/3] cpufreq: " Rafael J. Wysocki
@ 2016-02-04  0:08   ` Srinivas Pandruvada
  2016-02-04 17:16     ` Rafael J. Wysocki
  2016-02-04 10:51   ` Juri Lelli
  2016-02-08 23:06   ` Rafael J. Wysocki
  2 siblings, 1 reply; 134+ messages in thread
From: Srinivas Pandruvada @ 2016-02-04  0:08 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Viresh Kumar,
	Juri Lelli, Steve Muckle, Thomas Gleixner



On 02/03/2016 02:20 PM, Rafael J. Wysocki wrote:
> On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
>> Hi,
>>
>> The following patch series introduces a mechanism allowing the cpufreq core
>> and "setpolicy" drivers to provide utilization update callbacks to be invoked
>> by the scheduler on utilization changes.  Those callbacks can be used to run
>> the sampling and frequency adjustments code (intel_pstate) or to schedule the
>> execution of that code in process context (cpufreq core) instead of per-CPU
>> deferrable timers used in cpufreq today (which Thomas complained about during
>> the last Kernel Summit).
>>
>> [1/3] Introduce a mechanism for calling into cpufreq from the scheduler and
>>        registering callbacks to be executed from there.
>>
>> [2/3] Modify intel_pstate to use the mechanism introduced by [1/3] instead
>>        of per-CPU deferrable timers to do its work.
>>
>> This isn't entirely straightforward as the scheduler context running those
>> callbacks is really special.  Among other things it can only use raw
>> spinlocks and cannot invoke wake_up_process() directly.  Also, calling
>> ktime_get() from there may be too expensive on some systems.  All that has to
>> be taken into account, but even then the change allows some lines of code to be
>> cut from the driver.
>>
>> Some performance and energy consumption measurements have been carried out with
>> an earlier version of this patch and it looks like the changes lead to a
>> slightly better performing system that consumes slightly less energy at the
>> same time overall.
>>
>> [3/3] Modify the cpufreq core to use the mechanism introduced by [1/3] instead
>>        of per-CPU deferrable timers to queue up the execution of governor work.
>>
>> Again, this isn't really straightforward for the above reasons, but still the
>> code size is reduced a bit by the changes.
>>
>> I'm still unsure about the energy consumption and performance impact of [3/3]
>> as earlier versions of it led to inconsistent results (most likely due to bugs
>> in them that hopefully have been fixed in this version).  In particular, the
>> additional irq_work may turn out to be problematic, but more optimizations are
>> possible on top of this one even if it makes things worse by itself.
>>
>> For example, it should be possible to move the execution of state selection
>> code into the utilization update callback itself, at least in principle, for
>> all governors.  The P-state/OPP adjustment may need to be run from process
>> context still, but for the drivers that can do it without sleeping it should
>> be possible to move that into the utilization update callback as well.
>>
>> The patches are on top of 4.5-rc1 and have been tested on a couple of x86
>> machines.
> Well, no responses here, so I'm inclined to believe that this series is fine
> by everybody (at least by everybody in the CC).
>
> I can wait for a few days more, but new material is starting to pile up on top
> of these patches and I'll simply need to move forward at one point.
Based on the test results for intel_pstate and acpi_cpufreq, I don't see 
any problem in applying these patches.

Thanks,
Srinivas
> Thanks,
> Rafael
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 1/3] cpufreq: Add a mechanism for registering utilization update callbacks
  2016-01-29 22:53 ` [PATCH 1/3] cpufreq: Add a mechanism for registering " Rafael J. Wysocki
@ 2016-02-04  3:31   ` Viresh Kumar
  0 siblings, 0 replies; 134+ messages in thread
From: Viresh Kumar @ 2016-02-04  3:31 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Linux Kernel Mailing List, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Steve Muckle, Thomas Gleixner

On 29-01-16, 23:53, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Introduce a mechanism by which parts of the cpufreq subsystem
> ("setpolicy" drivers or the core) can register callbacks to be
> executed from cpufreq_update_util() which is invoked by the
> scheduler's update_load_avg() on CPU utilization changes.
> 
> This allows the "setpolicy" drivers to dispense with their timers
> and do all of the computations they need and frequency/voltage
> adjustments in the update_load_avg() code path, among other things.
> 
> The scheduler changes were suggested by Peter Zijlstra.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpufreq/cpufreq.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/cpufreq.h   |    7 +++++++
>  include/linux/sched.h     |    2 ++
>  kernel/sched/fair.c       |   29 ++++++++++++++++++++++++++++-
>  4 files changed, 81 insertions(+), 1 deletion(-)

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [Update][PATCH 3/3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-03  1:16   ` [Update][PATCH " Rafael J. Wysocki
@ 2016-02-04  4:49     ` Viresh Kumar
  2016-02-04 10:54       ` Rafael J. Wysocki
  2016-02-05  1:28     ` [PATCH 3/3 v3] " Rafael J. Wysocki
  1 sibling, 1 reply; 134+ messages in thread
From: Viresh Kumar @ 2016-02-04  4:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Linux Kernel Mailing List, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Steve Muckle, Thomas Gleixner

On 03-02-16, 02:16, Rafael J. Wysocki wrote:
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> -void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay)
> +void gov_set_update_util(struct cpu_common_dbs_info *shared,
> +			 unsigned int delay_us)
>  {
> +	struct cpufreq_policy *policy = shared->policy;
>  	struct dbs_data *dbs_data = policy->governor_data;
> -	struct cpu_dbs_info *cdbs;
>  	int cpu;
>  
> +	shared->sample_delay_ns = delay_us * NSEC_PER_USEC;
> +	shared->time_stamp = ktime_get();
> +
>  	for_each_cpu(cpu, policy->cpus) {
> -		cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
> -		cdbs->timer.expires = jiffies + delay;
> -		add_timer_on(&cdbs->timer, cpu);
> +		struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
> +
> +		cdbs->last_sample_time = 0;
> +		cpufreq_set_update_util_data(cpu, &cdbs->update_util);

Why no synchronize_rcu() here? This can be called from ondemand
governor on sampling-rate updates ..

>  	}
>  }
> -EXPORT_SYMBOL_GPL(gov_add_timers);
> +EXPORT_SYMBOL_GPL(gov_set_update_util);
>  
> -static inline void gov_cancel_timers(struct cpufreq_policy *policy)
> +static inline void gov_clear_update_util(struct cpufreq_policy *policy)
>  {
> -	struct dbs_data *dbs_data = policy->governor_data;
> -	struct cpu_dbs_info *cdbs;
>  	int i;
>  
> -	for_each_cpu(i, policy->cpus) {
> -		cdbs = dbs_data->cdata->get_cpu_cdbs(i);
> -		del_timer_sync(&cdbs->timer);
> -	}
> +	for_each_cpu(i, policy->cpus)
> +		cpufreq_set_update_util_data(i, NULL);
> +
> +	synchronize_rcu();
>  }
>  
>  void gov_cancel_work(struct cpu_common_dbs_info *shared)
>  {
> -	/* Tell dbs_timer_handler() to skip queuing up work items. */
> +	/* Tell dbs_update_util_handler() to skip queuing up work items. */
>  	atomic_inc(&shared->skip_work);
>  	/*
> -	 * If dbs_timer_handler() is already running, it may not notice the
> -	 * incremented skip_work, so wait for it to complete to prevent its work
> -	 * item from being queued up after the cancel_work_sync() below.
> -	 */
> -	gov_cancel_timers(shared->policy);
> -	/*
> -	 * In case dbs_timer_handler() managed to run and spawn a work item
> -	 * before the timers have been canceled, wait for that work item to
> -	 * complete and then cancel all of the timers set up by it.  If
> -	 * dbs_timer_handler() runs again at that point, it will see the
> -	 * positive value of skip_work and won't spawn any more work items.
> +	 * If dbs_update_util_handler() is already running, it may not notice
> +	 * the incremented skip_work, so wait for it to complete to prevent its
> +	 * work item from being queued up after the cancel_work_sync() below.
>  	 */
> +	gov_clear_update_util(shared->policy);
>  	cancel_work_sync(&shared->work);

How are we sure that the irq-work can't be pending at this point of
time, which will queue the above works again ?

> -	gov_cancel_timers(shared->policy);
>  	atomic_set(&shared->skip_work, 0);
>  }
>  EXPORT_SYMBOL_GPL(gov_cancel_work);
>  
> -/* Will return if we need to evaluate cpu load again or not */
> -static bool need_load_eval(struct cpu_common_dbs_info *shared,
> -			   unsigned int sampling_rate)
> -{
> -	if (policy_is_shared(shared->policy)) {
> -		ktime_t time_now = ktime_get();
> -		s64 delta_us = ktime_us_delta(time_now, shared->time_stamp);
> -
> -		/* Do nothing if we recently have sampled */
> -		if (delta_us < (s64)(sampling_rate / 2))
> -			return false;
> -		else
> -			shared->time_stamp = time_now;
> -	}
> -
> -	return true;
> -}
> -
>  static void dbs_work_handler(struct work_struct *work)
>  {
>  	struct cpu_common_dbs_info *shared = container_of(work, struct
> @@ -235,14 +212,10 @@ static void dbs_work_handler(struct work
>  	struct cpufreq_policy *policy;
>  	struct dbs_data *dbs_data;
>  	unsigned int sampling_rate, delay;
> -	bool eval_load;
>  
>  	policy = shared->policy;
>  	dbs_data = policy->governor_data;
>  
> -	/* Kill all timers */
> -	gov_cancel_timers(policy);
> -
>  	if (dbs_data->cdata->governor == GOV_CONSERVATIVE) {
>  		struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
>  
> @@ -253,37 +226,53 @@ static void dbs_work_handler(struct work
>  		sampling_rate = od_tuners->sampling_rate;
>  	}
>  
> -	eval_load = need_load_eval(shared, sampling_rate);
> -
>  	/*
> -	 * Make sure cpufreq_governor_limits() isn't evaluating load in
> +	 * Make sure cpufreq_governor_limits() isn't evaluating load or the
> +	 * ondemand governor isn't reading the time stamp and sampling rate in
>  	 * parallel.
>  	 */
>  	mutex_lock(&shared->timer_mutex);
> -	delay = dbs_data->cdata->gov_dbs_timer(policy, eval_load);
> +	delay = dbs_data->cdata->gov_dbs_timer(policy);
> +	shared->sample_delay_ns = jiffies_to_nsecs(delay);
> +	shared->time_stamp = ktime_get();
>  	mutex_unlock(&shared->timer_mutex);
>  
> +	smp_mb__before_atomic();

And why is this required exactly ? Maybe a comment as well to clarify
this as this isn't obvious ?

>  	atomic_dec(&shared->skip_work);
> +}
>  
> -	gov_add_timers(policy, delay);
> +static void dbs_irq_work(struct irq_work *irq_work)
> +{
> +	struct cpu_common_dbs_info *shared;
> +
> +	shared = container_of(irq_work, struct cpu_common_dbs_info, irq_work);
> +	schedule_work(&shared->work);
>  }
>  
> -static void dbs_timer_handler(unsigned long data)
> +static void dbs_update_util_handler(struct update_util_data *data, u64 time,
> +				    unsigned long util, unsigned long max)
>  {
> -	struct cpu_dbs_info *cdbs = (struct cpu_dbs_info *)data;
> +	struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
>  	struct cpu_common_dbs_info *shared = cdbs->shared;
>  
>  	/*
> -	 * Timer handler may not be allowed to queue the work at the moment,
> -	 * because:
> -	 * - Another timer handler has done that
> -	 * - We are stopping the governor
> -	 * - Or we are updating the sampling rate of the ondemand governor
> +	 * The work may not be allowed to be queued up right now.
> +	 * Possible reasons:
> +	 * - Work has already been queued up or is in progress.
> +	 * - The governor is being stopped.
> +	 * - It is too early (too little time from the previous sample).
>  	 */
> -	if (atomic_inc_return(&shared->skip_work) > 1)
> -		atomic_dec(&shared->skip_work);
> -	else
> -		queue_work(system_wq, &shared->work);
> +	if (atomic_inc_return(&shared->skip_work) == 1) {
> +		u64 delta_ns;
> +
> +		delta_ns = time - cdbs->last_sample_time;
> +		if ((s64)delta_ns >= shared->sample_delay_ns) {
> +			cdbs->last_sample_time = time;
> +			irq_work_queue_on(&shared->irq_work, smp_processor_id());
> +			return;
> +		}
> +	}
> +	atomic_dec(&shared->skip_work);
>  }
>  
>  static void set_sampling_rate(struct dbs_data *dbs_data,
> @@ -467,9 +456,6 @@ static int cpufreq_governor_start(struct
>  		io_busy = od_tuners->io_is_busy;
>  	}
>  
> -	shared->policy = policy;
> -	shared->time_stamp = ktime_get();
> -
>  	for_each_cpu(j, policy->cpus) {
>  		struct cpu_dbs_info *j_cdbs = cdata->get_cpu_cdbs(j);
>  		unsigned int prev_load;
> @@ -485,10 +471,10 @@ static int cpufreq_governor_start(struct
>  		if (ignore_nice)
>  			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
>  
> -		__setup_timer(&j_cdbs->timer, dbs_timer_handler,
> -			      (unsigned long)j_cdbs,
> -			      TIMER_DEFERRABLE | TIMER_IRQSAFE);
> +		j_cdbs->update_util.func = dbs_update_util_handler;
>  	}
> +	shared->policy = policy;
> +	init_irq_work(&shared->irq_work, dbs_irq_work);
>  
>  	if (cdata->governor == GOV_CONSERVATIVE) {
>  		struct cs_cpu_dbs_info_s *cs_dbs_info =
> @@ -505,7 +491,7 @@ static int cpufreq_governor_start(struct
>  		od_ops->powersave_bias_init_cpu(cpu);
>  	}
>  
> -	gov_add_timers(policy, delay_for_sampling_rate(sampling_rate));
> +	gov_set_update_util(shared, sampling_rate);
>  	return 0;
>  }
>  
> Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
> +++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> @@ -191,7 +191,7 @@ static void od_check_cpu(int cpu, unsign
>  	}
>  }
>  
> -static unsigned int od_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
> +static unsigned int od_dbs_timer(struct cpufreq_policy *policy)
>  {
>  	struct dbs_data *dbs_data = policy->governor_data;
>  	unsigned int cpu = policy->cpu;
> @@ -200,9 +200,6 @@ static unsigned int od_dbs_timer(struct
>  	struct od_dbs_tuners *od_tuners = dbs_data->tuners;
>  	int delay = 0, sample_type = dbs_info->sample_type;

Perhaps, the delay = 0 can be dropped now and ...

>  
> -	if (!modify_all)
> -		goto max_delay;
> -
>  	/* Common NORMAL_SAMPLE setup */
>  	dbs_info->sample_type = OD_NORMAL_SAMPLE;
>  	if (sample_type == OD_SUB_SAMPLE) {
> @@ -218,7 +215,6 @@ static unsigned int od_dbs_timer(struct
>  		}
>  	}
>  
> -max_delay:
>  	if (!delay)
>  		delay = delay_for_sampling_rate(od_tuners->sampling_rate
>  				* dbs_info->rate_mult);

^^ can be moved to the else part of above block ..

> @@ -264,7 +260,7 @@ static void update_sampling_rate(struct
>  		struct od_cpu_dbs_info_s *dbs_info;
>  		struct cpu_dbs_info *cdbs;
>  		struct cpu_common_dbs_info *shared;
> -		unsigned long next_sampling, appointed_at;
> +		ktime_t next_sampling, appointed_at;
>  
>  		dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
>  		cdbs = &dbs_info->cdbs;
> @@ -292,16 +288,19 @@ static void update_sampling_rate(struct
>  			continue;
>  
>  		/*
> -		 * Checking this for any CPU should be fine, timers for all of
> -		 * them are scheduled together.
> +		 * Checking this for any CPU sharing the policy should be fine,
> +		 * they are all scheduled to sample at the same time.
>  		 */
> -		next_sampling = jiffies + usecs_to_jiffies(new_rate);
> -		appointed_at = dbs_info->cdbs.timer.expires;
> +		next_sampling = ktime_add_us(ktime_get(), new_rate);
>  
> -		if (time_before(next_sampling, appointed_at)) {
> -			gov_cancel_work(shared);
> -			gov_add_timers(policy, usecs_to_jiffies(new_rate));
> +		mutex_lock(&shared->timer_mutex);

Why is taking this lock important here ?

> +		appointed_at = ktime_add_ns(shared->time_stamp,

Also I failed to understand why we need time_stamp variable at all?
Why can't we use last_sample_time ?

> +					    shared->sample_delay_ns);
> +		mutex_unlock(&shared->timer_mutex);
>  
> +		if (ktime_before(next_sampling, appointed_at)) {
> +			gov_cancel_work(shared);
> +			gov_set_update_util(shared, new_rate);

You don't need to a complete update here, the pointers are all fine.

>  		}
>  	}

-- 
viresh

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-03 22:20 ` [PATCH 0/3] cpufreq: " Rafael J. Wysocki
  2016-02-04  0:08   ` Srinivas Pandruvada
@ 2016-02-04 10:51   ` Juri Lelli
  2016-02-04 17:19     ` Rafael J. Wysocki
  2016-02-08 23:06   ` Rafael J. Wysocki
  2 siblings, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-02-04 10:51 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Linux Kernel Mailing List, Peter Zijlstra,
	Srinivas Pandruvada, Viresh Kumar, Steve Muckle, Thomas Gleixner

Hi Rafael,

On 03/02/16 23:20, Rafael J. Wysocki wrote:
> On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
> > Hi,
> > 
> > The following patch series introduces a mechanism allowing the cpufreq core
> > and "setpolicy" drivers to provide utilization update callbacks to be invoked
> > by the scheduler on utilization changes.  Those callbacks can be used to run
> > the sampling and frequency adjustments code (intel_pstate) or to schedule the
> > execution of that code in process context (cpufreq core) instead of per-CPU
> > deferrable timers used in cpufreq today (which Thomas complained about during
> > the last Kernel Summit).
> > 
> > [1/3] Introduce a mechanism for calling into cpufreq from the scheduler and
> >       registering callbacks to be executed from there.
> > 
> > [2/3] Modify intel_pstate to use the mechanism introduced by [1/3] instead
> >       of per-CPU deferrable timers to do its work.
> > 
> > This isn't entirely straightforward as the scheduler context running those
> > callbacks is really special.  Among other things it can only use raw
> > spinlocks and cannot invoke wake_up_process() directly.  Also, calling
> > ktime_get() from there may be too expensive on some systems.  All that has to
> > be taken into account, but even then the change allows some lines of code to be
> > cut from the driver.
> > 
> > Some performance and energy consumption measurements have been carried out with
> > an earlier version of this patch and it looks like the changes lead to a
> > slightly better performing system that consumes slightly less energy at the
> > same time overall.
> > 
> > [3/3] Modify the cpufreq core to use the mechanism introduced by [1/3] instead
> >       of per-CPU deferrable timers to queue up the execution of governor work.
> > 
> > Again, this isn't really straightforward for the above reasons, but still the
> > code size is reduced a bit by the changes.
> > 
> > I'm still unsure about the energy consumption and performance impact of [3/3]
> > as earlier versions of it led to inconsistent results (most likely due to bugs
> > in them that hopefully have been fixed in this version).  In particular, the
> > additional irq_work may turn out to be problematic, but more optimizations are
> > possible on top of this one even if it makes things worse by itself.
> > 
> > For example, it should be possible to move the execution of state selection
> > code into the utilization update callback itself, at least in principle, for
> > all governors.  The P-state/OPP adjustment may need to be run from process
> > context still, but for the drivers that can do it without sleeping it should
> > be possible to move that into the utilization update callback as well.
> > 
> > The patches are on top of 4.5-rc1 and have been tested on a couple of x86
> > machines.
> 
> Well, no responses here, so I'm inclined to believe that this series is fine
> by everybody (at least by everybody in the CC).
> 

I did intend to test and review this series, but then other patches
required attention as well and I didn't find time to have a look at
these. Sorry about that. Also, if I can speak for him, I think that
Steve is OOO this week.

> I can wait for a few days more, but new material is starting to pile up on top
> of these patches and I'll simply need to move forward at one point.
> 

Unfortunately, I can't promise anything at the moment, but, if I find
some time, I'll run some tests (BTW, do you have alredy something that I
can put to run on my boxes?). I guess I can eventually do that after
this gets merged as well.

Best,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [Update][PATCH 3/3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-04  4:49     ` Viresh Kumar
@ 2016-02-04 10:54       ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-04 10:54 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Linux PM list, Linux Kernel Mailing List, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Steve Muckle, Thomas Gleixner

On Thursday, February 04, 2016 10:19:59 AM Viresh Kumar wrote:
> On 03-02-16, 02:16, Rafael J. Wysocki wrote:
> > Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> > -void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay)
> > +void gov_set_update_util(struct cpu_common_dbs_info *shared,
> > +			 unsigned int delay_us)
> >  {
> > +	struct cpufreq_policy *policy = shared->policy;
> >  	struct dbs_data *dbs_data = policy->governor_data;
> > -	struct cpu_dbs_info *cdbs;
> >  	int cpu;
> >  
> > +	shared->sample_delay_ns = delay_us * NSEC_PER_USEC;
> > +	shared->time_stamp = ktime_get();
> > +
> >  	for_each_cpu(cpu, policy->cpus) {
> > -		cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
> > -		cdbs->timer.expires = jiffies + delay;
> > -		add_timer_on(&cdbs->timer, cpu);
> > +		struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
> > +
> > +		cdbs->last_sample_time = 0;
> > +		cpufreq_set_update_util_data(cpu, &cdbs->update_util);
> 
> Why no synchronize_rcu() here?

Because it is not needed.  This always changes a NULL pointer into a non-NULL.

> This can be called from ondemand governor on sampling-rate updates ..

But that calls gov_cancel_work() before, right?

> 
> >  	}
> >  }
> > -EXPORT_SYMBOL_GPL(gov_add_timers);
> > +EXPORT_SYMBOL_GPL(gov_set_update_util);
> >  
> > -static inline void gov_cancel_timers(struct cpufreq_policy *policy)
> > +static inline void gov_clear_update_util(struct cpufreq_policy *policy)
> >  {
> > -	struct dbs_data *dbs_data = policy->governor_data;
> > -	struct cpu_dbs_info *cdbs;
> >  	int i;
> >  
> > -	for_each_cpu(i, policy->cpus) {
> > -		cdbs = dbs_data->cdata->get_cpu_cdbs(i);
> > -		del_timer_sync(&cdbs->timer);
> > -	}
> > +	for_each_cpu(i, policy->cpus)
> > +		cpufreq_set_update_util_data(i, NULL);
> > +
> > +	synchronize_rcu();
> >  }
> >  
> >  void gov_cancel_work(struct cpu_common_dbs_info *shared)
> >  {
> > -	/* Tell dbs_timer_handler() to skip queuing up work items. */
> > +	/* Tell dbs_update_util_handler() to skip queuing up work items. */
> >  	atomic_inc(&shared->skip_work);
> >  	/*
> > -	 * If dbs_timer_handler() is already running, it may not notice the
> > -	 * incremented skip_work, so wait for it to complete to prevent its work
> > -	 * item from being queued up after the cancel_work_sync() below.
> > -	 */
> > -	gov_cancel_timers(shared->policy);
> > -	/*
> > -	 * In case dbs_timer_handler() managed to run and spawn a work item
> > -	 * before the timers have been canceled, wait for that work item to
> > -	 * complete and then cancel all of the timers set up by it.  If
> > -	 * dbs_timer_handler() runs again at that point, it will see the
> > -	 * positive value of skip_work and won't spawn any more work items.
> > +	 * If dbs_update_util_handler() is already running, it may not notice
> > +	 * the incremented skip_work, so wait for it to complete to prevent its
> > +	 * work item from being queued up after the cancel_work_sync() below.
> >  	 */
> > +	gov_clear_update_util(shared->policy);
> >  	cancel_work_sync(&shared->work);
> 
> How are we sure that the irq-work can't be pending at this point of
> time, which will queue the above works again ?

Good point.  The irq_work has to be waited for here too.

> > -	gov_cancel_timers(shared->policy);
> >  	atomic_set(&shared->skip_work, 0);
> >  }
> >  EXPORT_SYMBOL_GPL(gov_cancel_work);
> >  
> > -/* Will return if we need to evaluate cpu load again or not */
> > -static bool need_load_eval(struct cpu_common_dbs_info *shared,
> > -			   unsigned int sampling_rate)
> > -{
> > -	if (policy_is_shared(shared->policy)) {
> > -		ktime_t time_now = ktime_get();
> > -		s64 delta_us = ktime_us_delta(time_now, shared->time_stamp);
> > -
> > -		/* Do nothing if we recently have sampled */
> > -		if (delta_us < (s64)(sampling_rate / 2))
> > -			return false;
> > -		else
> > -			shared->time_stamp = time_now;
> > -	}
> > -
> > -	return true;
> > -}
> > -
> >  static void dbs_work_handler(struct work_struct *work)
> >  {
> >  	struct cpu_common_dbs_info *shared = container_of(work, struct
> > @@ -235,14 +212,10 @@ static void dbs_work_handler(struct work
> >  	struct cpufreq_policy *policy;
> >  	struct dbs_data *dbs_data;
> >  	unsigned int sampling_rate, delay;
> > -	bool eval_load;
> >  
> >  	policy = shared->policy;
> >  	dbs_data = policy->governor_data;
> >  
> > -	/* Kill all timers */
> > -	gov_cancel_timers(policy);
> > -
> >  	if (dbs_data->cdata->governor == GOV_CONSERVATIVE) {
> >  		struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
> >  
> > @@ -253,37 +226,53 @@ static void dbs_work_handler(struct work
> >  		sampling_rate = od_tuners->sampling_rate;
> >  	}
> >  
> > -	eval_load = need_load_eval(shared, sampling_rate);
> > -
> >  	/*
> > -	 * Make sure cpufreq_governor_limits() isn't evaluating load in
> > +	 * Make sure cpufreq_governor_limits() isn't evaluating load or the
> > +	 * ondemand governor isn't reading the time stamp and sampling rate in
> >  	 * parallel.
> >  	 */
> >  	mutex_lock(&shared->timer_mutex);
> > -	delay = dbs_data->cdata->gov_dbs_timer(policy, eval_load);
> > +	delay = dbs_data->cdata->gov_dbs_timer(policy);
> > +	shared->sample_delay_ns = jiffies_to_nsecs(delay);
> > +	shared->time_stamp = ktime_get();
> >  	mutex_unlock(&shared->timer_mutex);
> >  
> > +	smp_mb__before_atomic();
> 
> And why is this required exactly ? Maybe a comment as well to clarify
> this as this isn't obvious ?

OK, you have a point.

This relies on the atomic_dec() below to happen after sample_delay_ns has
been updated, to prevent dbs_update_util_handler() from using a stale
value.

> >  	atomic_dec(&shared->skip_work);
> > +}
> >  
> > -	gov_add_timers(policy, delay);
> > +static void dbs_irq_work(struct irq_work *irq_work)
> > +{
> > +	struct cpu_common_dbs_info *shared;
> > +
> > +	shared = container_of(irq_work, struct cpu_common_dbs_info, irq_work);
> > +	schedule_work(&shared->work);
> >  }
> >  
> > -static void dbs_timer_handler(unsigned long data)
> > +static void dbs_update_util_handler(struct update_util_data *data, u64 time,
> > +				    unsigned long util, unsigned long max)
> >  {
> > -	struct cpu_dbs_info *cdbs = (struct cpu_dbs_info *)data;
> > +	struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
> >  	struct cpu_common_dbs_info *shared = cdbs->shared;
> >  
> >  	/*
> > -	 * Timer handler may not be allowed to queue the work at the moment,
> > -	 * because:
> > -	 * - Another timer handler has done that
> > -	 * - We are stopping the governor
> > -	 * - Or we are updating the sampling rate of the ondemand governor
> > +	 * The work may not be allowed to be queued up right now.
> > +	 * Possible reasons:
> > +	 * - Work has already been queued up or is in progress.
> > +	 * - The governor is being stopped.
> > +	 * - It is too early (too little time from the previous sample).
> >  	 */
> > -	if (atomic_inc_return(&shared->skip_work) > 1)
> > -		atomic_dec(&shared->skip_work);
> > -	else
> > -		queue_work(system_wq, &shared->work);
> > +	if (atomic_inc_return(&shared->skip_work) == 1) {
> > +		u64 delta_ns;
> > +
> > +		delta_ns = time - cdbs->last_sample_time;
> > +		if ((s64)delta_ns >= shared->sample_delay_ns) {
> > +			cdbs->last_sample_time = time;
> > +			irq_work_queue_on(&shared->irq_work, smp_processor_id());
> > +			return;
> > +		}
> > +	}
> > +	atomic_dec(&shared->skip_work);
> >  }
> >  
> >  static void set_sampling_rate(struct dbs_data *dbs_data,
> > @@ -467,9 +456,6 @@ static int cpufreq_governor_start(struct
> >  		io_busy = od_tuners->io_is_busy;
> >  	}
> >  
> > -	shared->policy = policy;
> > -	shared->time_stamp = ktime_get();
> > -
> >  	for_each_cpu(j, policy->cpus) {
> >  		struct cpu_dbs_info *j_cdbs = cdata->get_cpu_cdbs(j);
> >  		unsigned int prev_load;
> > @@ -485,10 +471,10 @@ static int cpufreq_governor_start(struct
> >  		if (ignore_nice)
> >  			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
> >  
> > -		__setup_timer(&j_cdbs->timer, dbs_timer_handler,
> > -			      (unsigned long)j_cdbs,
> > -			      TIMER_DEFERRABLE | TIMER_IRQSAFE);
> > +		j_cdbs->update_util.func = dbs_update_util_handler;
> >  	}
> > +	shared->policy = policy;
> > +	init_irq_work(&shared->irq_work, dbs_irq_work);
> >  
> >  	if (cdata->governor == GOV_CONSERVATIVE) {
> >  		struct cs_cpu_dbs_info_s *cs_dbs_info =
> > @@ -505,7 +491,7 @@ static int cpufreq_governor_start(struct
> >  		od_ops->powersave_bias_init_cpu(cpu);
> >  	}
> >  
> > -	gov_add_timers(policy, delay_for_sampling_rate(sampling_rate));
> > +	gov_set_update_util(shared, sampling_rate);
> >  	return 0;
> >  }
> >  
> > Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
> > +++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> > @@ -191,7 +191,7 @@ static void od_check_cpu(int cpu, unsign
> >  	}
> >  }
> >  
> > -static unsigned int od_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
> > +static unsigned int od_dbs_timer(struct cpufreq_policy *policy)
> >  {
> >  	struct dbs_data *dbs_data = policy->governor_data;
> >  	unsigned int cpu = policy->cpu;
> > @@ -200,9 +200,6 @@ static unsigned int od_dbs_timer(struct
> >  	struct od_dbs_tuners *od_tuners = dbs_data->tuners;
> >  	int delay = 0, sample_type = dbs_info->sample_type;
> 
> Perhaps, the delay = 0 can be dropped now and ...
> 
> >  
> > -	if (!modify_all)
> > -		goto max_delay;
> > -
> >  	/* Common NORMAL_SAMPLE setup */
> >  	dbs_info->sample_type = OD_NORMAL_SAMPLE;
> >  	if (sample_type == OD_SUB_SAMPLE) {
> > @@ -218,7 +215,6 @@ static unsigned int od_dbs_timer(struct
> >  		}
> >  	}
> >  
> > -max_delay:
> >  	if (!delay)
> >  		delay = delay_for_sampling_rate(od_tuners->sampling_rate
> >  				* dbs_info->rate_mult);
> 
> ^^ can be moved to the else part of above block ..

Both this and the above are valid observation, but those changes should be
made in a follow-up patch IMO.

> > @@ -264,7 +260,7 @@ static void update_sampling_rate(struct
> >  		struct od_cpu_dbs_info_s *dbs_info;
> >  		struct cpu_dbs_info *cdbs;
> >  		struct cpu_common_dbs_info *shared;
> > -		unsigned long next_sampling, appointed_at;
> > +		ktime_t next_sampling, appointed_at;
> >  
> >  		dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
> >  		cdbs = &dbs_info->cdbs;
> > @@ -292,16 +288,19 @@ static void update_sampling_rate(struct
> >  			continue;
> >  
> >  		/*
> > -		 * Checking this for any CPU should be fine, timers for all of
> > -		 * them are scheduled together.
> > +		 * Checking this for any CPU sharing the policy should be fine,
> > +		 * they are all scheduled to sample at the same time.
> >  		 */
> > -		next_sampling = jiffies + usecs_to_jiffies(new_rate);
> > -		appointed_at = dbs_info->cdbs.timer.expires;
> > +		next_sampling = ktime_add_us(ktime_get(), new_rate);
> >  
> > -		if (time_before(next_sampling, appointed_at)) {
> > -			gov_cancel_work(shared);
> > -			gov_add_timers(policy, usecs_to_jiffies(new_rate));
> > +		mutex_lock(&shared->timer_mutex);
> 
> Why is taking this lock important here ?

Because this reads both time_stamp and sample_delay_ns and uses them in
a computation.  If they happen to be out of sync, this surely isn't right. 

> > +		appointed_at = ktime_add_ns(shared->time_stamp,
> 
> Also I failed to understand why we need time_stamp variable at all?
> Why can't we use last_sample_time ?

Because the time base for last_sample_time may be different, so comparing it
to the return value of ktime_get() may not lead to correct decisions, so to
speak.

> > +					    shared->sample_delay_ns);
> > +		mutex_unlock(&shared->timer_mutex);
> >  
> > +		if (ktime_before(next_sampling, appointed_at)) {
> > +			gov_cancel_work(shared);
> > +			gov_set_update_util(shared, new_rate);
> 
> You don't need to a complete update here, the pointers are all fine.

I do, but that's not because of the pointers.

Effectively, I need to change sample_delay_ns and that's the most startghtforward
way to do that safely.

It may not be the most efficient, but this is not a fast path anyway.

> >  		}
> >  	}
> 
> 

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-04  0:08   ` Srinivas Pandruvada
@ 2016-02-04 17:16     ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-04 17:16 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Peter Zijlstra, Viresh Kumar, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On Thu, Feb 4, 2016 at 1:08 AM, Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
>
> On 02/03/2016 02:20 PM, Rafael J. Wysocki wrote:
>>
>> On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
>>>
>>> Hi,
>>>
>>> The following patch series introduces a mechanism allowing the cpufreq
>>> core
>>> and "setpolicy" drivers to provide utilization update callbacks to be
>>> invoked
>>> by the scheduler on utilization changes.  Those callbacks can be used to
>>> run
>>> the sampling and frequency adjustments code (intel_pstate) or to schedule
>>> the
>>> execution of that code in process context (cpufreq core) instead of
>>> per-CPU
>>> deferrable timers used in cpufreq today (which Thomas complained about
>>> during
>>> the last Kernel Summit).
>>>
>>> [1/3] Introduce a mechanism for calling into cpufreq from the scheduler
>>> and
>>>        registering callbacks to be executed from there.
>>>
>>> [2/3] Modify intel_pstate to use the mechanism introduced by [1/3]
>>> instead
>>>        of per-CPU deferrable timers to do its work.
>>>
>>> This isn't entirely straightforward as the scheduler context running
>>> those
>>> callbacks is really special.  Among other things it can only use raw
>>> spinlocks and cannot invoke wake_up_process() directly.  Also, calling
>>> ktime_get() from there may be too expensive on some systems.  All that
>>> has to
>>> be taken into account, but even then the change allows some lines of code
>>> to be
>>> cut from the driver.
>>>
>>> Some performance and energy consumption measurements have been carried
>>> out with
>>> an earlier version of this patch and it looks like the changes lead to a
>>> slightly better performing system that consumes slightly less energy at
>>> the
>>> same time overall.
>>>
>>> [3/3] Modify the cpufreq core to use the mechanism introduced by [1/3]
>>> instead
>>>        of per-CPU deferrable timers to queue up the execution of governor
>>> work.
>>>
>>> Again, this isn't really straightforward for the above reasons, but still
>>> the
>>> code size is reduced a bit by the changes.
>>>
>>> I'm still unsure about the energy consumption and performance impact of
>>> [3/3]
>>> as earlier versions of it led to inconsistent results (most likely due to
>>> bugs
>>> in them that hopefully have been fixed in this version).  In particular,
>>> the
>>> additional irq_work may turn out to be problematic, but more
>>> optimizations are
>>> possible on top of this one even if it makes things worse by itself.
>>>
>>> For example, it should be possible to move the execution of state
>>> selection
>>> code into the utilization update callback itself, at least in principle,
>>> for
>>> all governors.  The P-state/OPP adjustment may need to be run from
>>> process
>>> context still, but for the drivers that can do it without sleeping it
>>> should
>>> be possible to move that into the utilization update callback as well.
>>>
>>> The patches are on top of 4.5-rc1 and have been tested on a couple of x86
>>> machines.
>>
>> Well, no responses here, so I'm inclined to believe that this series is
>> fine
>> by everybody (at least by everybody in the CC).
>>
>> I can wait for a few days more, but new material is starting to pile up on
>> top
>> of these patches and I'll simply need to move forward at one point.
>
> Based on the test results for intel_pstate and acpi_cpufreq, I don't see any
> problem in applying these patches.

OK, I'm taking this as an ACK for the intel_pstate changes. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-04 10:51   ` Juri Lelli
@ 2016-02-04 17:19     ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-04 17:19 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Steve Muckle,
	Thomas Gleixner

On Thu, Feb 4, 2016 at 11:51 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> Hi Rafael,
>
> On 03/02/16 23:20, Rafael J. Wysocki wrote:
>> On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
>> > Hi,
>> >
>> > The following patch series introduces a mechanism allowing the cpufreq core
>> > and "setpolicy" drivers to provide utilization update callbacks to be invoked
>> > by the scheduler on utilization changes.  Those callbacks can be used to run
>> > the sampling and frequency adjustments code (intel_pstate) or to schedule the
>> > execution of that code in process context (cpufreq core) instead of per-CPU
>> > deferrable timers used in cpufreq today (which Thomas complained about during
>> > the last Kernel Summit).
>> >
>> > [1/3] Introduce a mechanism for calling into cpufreq from the scheduler and
>> >       registering callbacks to be executed from there.
>> >
>> > [2/3] Modify intel_pstate to use the mechanism introduced by [1/3] instead
>> >       of per-CPU deferrable timers to do its work.
>> >
>> > This isn't entirely straightforward as the scheduler context running those
>> > callbacks is really special.  Among other things it can only use raw
>> > spinlocks and cannot invoke wake_up_process() directly.  Also, calling
>> > ktime_get() from there may be too expensive on some systems.  All that has to
>> > be taken into account, but even then the change allows some lines of code to be
>> > cut from the driver.
>> >
>> > Some performance and energy consumption measurements have been carried out with
>> > an earlier version of this patch and it looks like the changes lead to a
>> > slightly better performing system that consumes slightly less energy at the
>> > same time overall.
>> >
>> > [3/3] Modify the cpufreq core to use the mechanism introduced by [1/3] instead
>> >       of per-CPU deferrable timers to queue up the execution of governor work.
>> >
>> > Again, this isn't really straightforward for the above reasons, but still the
>> > code size is reduced a bit by the changes.
>> >
>> > I'm still unsure about the energy consumption and performance impact of [3/3]
>> > as earlier versions of it led to inconsistent results (most likely due to bugs
>> > in them that hopefully have been fixed in this version).  In particular, the
>> > additional irq_work may turn out to be problematic, but more optimizations are
>> > possible on top of this one even if it makes things worse by itself.
>> >
>> > For example, it should be possible to move the execution of state selection
>> > code into the utilization update callback itself, at least in principle, for
>> > all governors.  The P-state/OPP adjustment may need to be run from process
>> > context still, but for the drivers that can do it without sleeping it should
>> > be possible to move that into the utilization update callback as well.
>> >
>> > The patches are on top of 4.5-rc1 and have been tested on a couple of x86
>> > machines.
>>
>> Well, no responses here, so I'm inclined to believe that this series is fine
>> by everybody (at least by everybody in the CC).
>>
>
> I did intend to test and review this series, but then other patches
> required attention as well and I didn't find time to have a look at
> these. Sorry about that. Also, if I can speak for him, I think that
> Steve is OOO this week.

No problem at all.

>> I can wait for a few days more, but new material is starting to pile up on top
>> of these patches and I'll simply need to move forward at one point.
>>
>
> Unfortunately, I can't promise anything at the moment, but, if I find
> some time, I'll run some tests (BTW, do you have alredy something that I
> can put to run on my boxes?). I guess I can eventually do that after
> this gets merged as well.

Thanks!

Well, everything that might regress performance-wise or from the
energy consumption standpoint would be good to run.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-03  1:16   ` [Update][PATCH " Rafael J. Wysocki
  2016-02-04  4:49     ` Viresh Kumar
@ 2016-02-05  1:28     ` Rafael J. Wysocki
  2016-02-05  6:50       ` Viresh Kumar
  2016-02-06  3:40       ` [PATCH 3/3 v4] " Rafael J. Wysocki
  1 sibling, 2 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-05  1:28 UTC (permalink / raw)
  To: Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Instead of using a per-CPU deferrable timer for queuing up governor
work items, register a utilization update callback that will be
invoked from the scheduler on utilization changes.

The sampling rate is still the same as what was used for the
deferrable timers and the added irq_work overhead should be offset by
the eliminated timers overhead, so in theory the functional impact of
this patch should not be significant.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

The v3 addresses some review comments from Viresh and a couple of issues found
by me.  Changes from the previous version:
- Synchronize gov_cancel_work() with the (new) irq_work properly.
- Add a comment about the (new) memory barrier.
- Move samle_delay_ns to "shared" (struct cpu_common_dbs_info) so it is the
  same for all policy CPUs (without this modification we may end up taking
  samples too often).
- Drop some more unused code (in dbs_work_handler()).

Thanks,
Rafael

---
 drivers/cpufreq/cpufreq_conservative.c |    6 -
 drivers/cpufreq/cpufreq_governor.c     |  157 ++++++++++++++-------------------
 drivers/cpufreq/cpufreq_governor.h     |   15 ++-
 drivers/cpufreq/cpufreq_ondemand.c     |   25 ++---
 4 files changed, 95 insertions(+), 108 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -18,6 +18,8 @@
 #define _CPUFREQ_GOVERNOR_H
 
 #include <linux/atomic.h>
+#include <linux/irq_work.h>
+#include <linux/completion.h>
 #include <linux/cpufreq.h>
 #include <linux/kernel_stat.h>
 #include <linux/module.h>
@@ -139,7 +141,11 @@ struct cpu_common_dbs_info {
 	struct mutex timer_mutex;
 
 	ktime_t time_stamp;
+	u64 last_sample_time;
+	s64 sample_delay_ns;
 	atomic_t skip_work;
+	struct irq_work irq_work;
+	struct completion irq_work_done;
 	struct work_struct work;
 };
 
@@ -155,7 +161,7 @@ struct cpu_dbs_info {
 	 * wake-up from idle.
 	 */
 	unsigned int prev_load;
-	struct timer_list timer;
+	struct update_util_data update_util;
 	struct cpu_common_dbs_info *shared;
 };
 
@@ -212,8 +218,7 @@ struct common_dbs_data {
 
 	struct cpu_dbs_info *(*get_cpu_cdbs)(int cpu);
 	void *(*get_cpu_dbs_info_s)(int cpu);
-	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy,
-				      bool modify_all);
+	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy);
 	void (*gov_check_cpu)(int cpu, unsigned int load);
 	int (*init)(struct dbs_data *dbs_data, bool notify);
 	void (*exit)(struct dbs_data *dbs_data, bool notify);
@@ -270,8 +275,8 @@ static ssize_t show_sampling_rate_min_go
 }
 
 extern struct mutex cpufreq_governor_lock;
-
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay);
+void gov_set_update_util(struct cpu_common_dbs_info *shared,
+			 unsigned int delay_us);
 void gov_cancel_work(struct cpu_common_dbs_info *shared);
 void dbs_check_cpu(struct dbs_data *dbs_data, int cpu);
 int cpufreq_governor_dbs(struct cpufreq_policy *policy,
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -128,10 +128,10 @@ void dbs_check_cpu(struct dbs_data *dbs_
 		 * dropped down. So we perform the copy only once, upon the
 		 * first wake-up from idle.)
 		 *
-		 * Detecting this situation is easy: the governor's deferrable
-		 * timer would not have fired during CPU-idle periods. Hence
-		 * an unusually large 'wall_time' (as compared to the sampling
-		 * rate) indicates this scenario.
+		 * Detecting this situation is easy: the governor's utilization
+		 * update handler would not have run during CPU-idle periods.
+		 * Hence, an unusually large 'wall_time' (as compared to the
+		 * sampling rate) indicates this scenario.
 		 *
 		 * prev_load can be zero in two cases and we must recalculate it
 		 * for both cases:
@@ -161,129 +161,116 @@ void dbs_check_cpu(struct dbs_data *dbs_
 }
 EXPORT_SYMBOL_GPL(dbs_check_cpu);
 
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay)
+void gov_set_update_util(struct cpu_common_dbs_info *shared,
+			 unsigned int delay_us)
 {
+	struct cpufreq_policy *policy = shared->policy;
 	struct dbs_data *dbs_data = policy->governor_data;
-	struct cpu_dbs_info *cdbs;
 	int cpu;
 
+	shared->sample_delay_ns = delay_us * NSEC_PER_USEC;
+	shared->time_stamp = ktime_get();
+	shared->last_sample_time = 0;
+
 	for_each_cpu(cpu, policy->cpus) {
-		cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
-		cdbs->timer.expires = jiffies + delay;
-		add_timer_on(&cdbs->timer, cpu);
+		struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
+
+		cpufreq_set_update_util_data(cpu, &cdbs->update_util);
 	}
 }
-EXPORT_SYMBOL_GPL(gov_add_timers);
+EXPORT_SYMBOL_GPL(gov_set_update_util);
 
-static inline void gov_cancel_timers(struct cpufreq_policy *policy)
+static inline void gov_clear_update_util(struct cpufreq_policy *policy)
 {
-	struct dbs_data *dbs_data = policy->governor_data;
-	struct cpu_dbs_info *cdbs;
 	int i;
 
-	for_each_cpu(i, policy->cpus) {
-		cdbs = dbs_data->cdata->get_cpu_cdbs(i);
-		del_timer_sync(&cdbs->timer);
-	}
+	for_each_cpu(i, policy->cpus)
+		cpufreq_set_update_util_data(i, NULL);
+
+	synchronize_rcu();
 }
 
 void gov_cancel_work(struct cpu_common_dbs_info *shared)
 {
-	/* Tell dbs_timer_handler() to skip queuing up work items. */
+	/* Tell dbs_update_util_handler() to skip queuing up work items. */
 	atomic_inc(&shared->skip_work);
 	/*
-	 * If dbs_timer_handler() is already running, it may not notice the
-	 * incremented skip_work, so wait for it to complete to prevent its work
-	 * item from being queued up after the cancel_work_sync() below.
-	 */
-	gov_cancel_timers(shared->policy);
-	/*
-	 * In case dbs_timer_handler() managed to run and spawn a work item
-	 * before the timers have been canceled, wait for that work item to
-	 * complete and then cancel all of the timers set up by it.  If
-	 * dbs_timer_handler() runs again at that point, it will see the
-	 * positive value of skip_work and won't spawn any more work items.
+	 * If dbs_update_util_handler() is already running, it may not notice
+	 * the incremented skip_work, so wait for it to complete to prevent its
+	 * work item from being queued up after the cancel_work_sync() below.
 	 */
+	gov_clear_update_util(shared->policy);
+	wait_for_completion(&shared->irq_work_done);
 	cancel_work_sync(&shared->work);
-	gov_cancel_timers(shared->policy);
 	atomic_set(&shared->skip_work, 0);
 }
 EXPORT_SYMBOL_GPL(gov_cancel_work);
 
-/* Will return if we need to evaluate cpu load again or not */
-static bool need_load_eval(struct cpu_common_dbs_info *shared,
-			   unsigned int sampling_rate)
-{
-	if (policy_is_shared(shared->policy)) {
-		ktime_t time_now = ktime_get();
-		s64 delta_us = ktime_us_delta(time_now, shared->time_stamp);
-
-		/* Do nothing if we recently have sampled */
-		if (delta_us < (s64)(sampling_rate / 2))
-			return false;
-		else
-			shared->time_stamp = time_now;
-	}
-
-	return true;
-}
-
 static void dbs_work_handler(struct work_struct *work)
 {
 	struct cpu_common_dbs_info *shared = container_of(work, struct
 					cpu_common_dbs_info, work);
 	struct cpufreq_policy *policy;
 	struct dbs_data *dbs_data;
-	unsigned int sampling_rate, delay;
-	bool eval_load;
+	unsigned int delay;
 
 	policy = shared->policy;
 	dbs_data = policy->governor_data;
 
-	/* Kill all timers */
-	gov_cancel_timers(policy);
-
-	if (dbs_data->cdata->governor == GOV_CONSERVATIVE) {
-		struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
-
-		sampling_rate = cs_tuners->sampling_rate;
-	} else {
-		struct od_dbs_tuners *od_tuners = dbs_data->tuners;
-
-		sampling_rate = od_tuners->sampling_rate;
-	}
-
-	eval_load = need_load_eval(shared, sampling_rate);
-
 	/*
-	 * Make sure cpufreq_governor_limits() isn't evaluating load in
+	 * Make sure cpufreq_governor_limits() isn't evaluating load or the
+	 * ondemand governor isn't reading the time stamp and sampling rate in
 	 * parallel.
 	 */
 	mutex_lock(&shared->timer_mutex);
-	delay = dbs_data->cdata->gov_dbs_timer(policy, eval_load);
+	delay = dbs_data->cdata->gov_dbs_timer(policy);
+	shared->sample_delay_ns = jiffies_to_nsecs(delay);
+	shared->time_stamp = ktime_get();
 	mutex_unlock(&shared->timer_mutex);
 
+	/*
+	 * If the atomic operation below is reordered with respect to the
+	 * sample delay modification, the utilization update handler may end
+	 * up using a stale sample delay value.
+	 */
+	smp_mb__before_atomic();
 	atomic_dec(&shared->skip_work);
+}
 
-	gov_add_timers(policy, delay);
+static void dbs_irq_work(struct irq_work *irq_work)
+{
+	struct cpu_common_dbs_info *shared;
+
+	shared = container_of(irq_work, struct cpu_common_dbs_info, irq_work);
+	schedule_work(&shared->work);
+	complete(&shared->irq_work_done);
 }
 
-static void dbs_timer_handler(unsigned long data)
+static void dbs_update_util_handler(struct update_util_data *data, u64 time,
+				    unsigned long util, unsigned long max)
 {
-	struct cpu_dbs_info *cdbs = (struct cpu_dbs_info *)data;
+	struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
 	struct cpu_common_dbs_info *shared = cdbs->shared;
 
 	/*
-	 * Timer handler may not be allowed to queue the work at the moment,
-	 * because:
-	 * - Another timer handler has done that
-	 * - We are stopping the governor
-	 * - Or we are updating the sampling rate of the ondemand governor
+	 * The work may not be allowed to be queued up right now.
+	 * Possible reasons:
+	 * - Work has already been queued up or is in progress.
+	 * - The governor is being stopped.
+	 * - It is too early (too little time from the previous sample).
 	 */
-	if (atomic_inc_return(&shared->skip_work) > 1)
-		atomic_dec(&shared->skip_work);
-	else
-		queue_work(system_wq, &shared->work);
+	if (atomic_inc_return(&shared->skip_work) == 1) {
+		u64 delta_ns;
+
+		delta_ns = time - shared->last_sample_time;
+		if ((s64)delta_ns >= shared->sample_delay_ns) {
+			shared->last_sample_time = time;
+			reinit_completion(&shared->irq_work_done);
+			irq_work_queue_on(&shared->irq_work, smp_processor_id());
+			return;
+		}
+	}
+	atomic_dec(&shared->skip_work);
 }
 
 static void set_sampling_rate(struct dbs_data *dbs_data,
@@ -467,9 +454,6 @@ static int cpufreq_governor_start(struct
 		io_busy = od_tuners->io_is_busy;
 	}
 
-	shared->policy = policy;
-	shared->time_stamp = ktime_get();
-
 	for_each_cpu(j, policy->cpus) {
 		struct cpu_dbs_info *j_cdbs = cdata->get_cpu_cdbs(j);
 		unsigned int prev_load;
@@ -485,10 +469,11 @@ static int cpufreq_governor_start(struct
 		if (ignore_nice)
 			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 
-		__setup_timer(&j_cdbs->timer, dbs_timer_handler,
-			      (unsigned long)j_cdbs,
-			      TIMER_DEFERRABLE | TIMER_IRQSAFE);
+		j_cdbs->update_util.func = dbs_update_util_handler;
 	}
+	shared->policy = policy;
+	init_irq_work(&shared->irq_work, dbs_irq_work);
+	init_completion(&shared->irq_work_done);
 
 	if (cdata->governor == GOV_CONSERVATIVE) {
 		struct cs_cpu_dbs_info_s *cs_dbs_info =
@@ -505,7 +490,7 @@ static int cpufreq_governor_start(struct
 		od_ops->powersave_bias_init_cpu(cpu);
 	}
 
-	gov_add_timers(policy, delay_for_sampling_rate(sampling_rate));
+	gov_set_update_util(shared, sampling_rate);
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -191,7 +191,7 @@ static void od_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int od_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int od_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	unsigned int cpu = policy->cpu;
@@ -200,9 +200,6 @@ static unsigned int od_dbs_timer(struct
 	struct od_dbs_tuners *od_tuners = dbs_data->tuners;
 	int delay = 0, sample_type = dbs_info->sample_type;
 
-	if (!modify_all)
-		goto max_delay;
-
 	/* Common NORMAL_SAMPLE setup */
 	dbs_info->sample_type = OD_NORMAL_SAMPLE;
 	if (sample_type == OD_SUB_SAMPLE) {
@@ -218,7 +215,6 @@ static unsigned int od_dbs_timer(struct
 		}
 	}
 
-max_delay:
 	if (!delay)
 		delay = delay_for_sampling_rate(od_tuners->sampling_rate
 				* dbs_info->rate_mult);
@@ -264,7 +260,7 @@ static void update_sampling_rate(struct
 		struct od_cpu_dbs_info_s *dbs_info;
 		struct cpu_dbs_info *cdbs;
 		struct cpu_common_dbs_info *shared;
-		unsigned long next_sampling, appointed_at;
+		ktime_t next_sampling, appointed_at;
 
 		dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
 		cdbs = &dbs_info->cdbs;
@@ -292,16 +288,19 @@ static void update_sampling_rate(struct
 			continue;
 
 		/*
-		 * Checking this for any CPU should be fine, timers for all of
-		 * them are scheduled together.
+		 * Checking this for any CPU sharing the policy should be fine,
+		 * they are all scheduled to sample at the same time.
 		 */
-		next_sampling = jiffies + usecs_to_jiffies(new_rate);
-		appointed_at = dbs_info->cdbs.timer.expires;
+		next_sampling = ktime_add_us(ktime_get(), new_rate);
 
-		if (time_before(next_sampling, appointed_at)) {
-			gov_cancel_work(shared);
-			gov_add_timers(policy, usecs_to_jiffies(new_rate));
+		mutex_lock(&shared->timer_mutex);
+		appointed_at = ktime_add_ns(shared->time_stamp,
+					    shared->sample_delay_ns);
+		mutex_unlock(&shared->timer_mutex);
 
+		if (ktime_before(next_sampling, appointed_at)) {
+			gov_cancel_work(shared);
+			gov_set_update_util(shared, new_rate);
 		}
 	}
 
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -115,14 +115,12 @@ static void cs_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int cs_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int cs_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
-	if (modify_all)
-		dbs_check_cpu(dbs_data, policy->cpu);
-
+	dbs_check_cpu(dbs_data, policy->cpu);
 	return delay_for_sampling_rate(cs_tuners->sampling_rate);
 }
 

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-05  1:28     ` [PATCH 3/3 v3] " Rafael J. Wysocki
@ 2016-02-05  6:50       ` Viresh Kumar
  2016-02-05 13:36         ` Rafael J. Wysocki
  2016-02-06  3:40       ` [PATCH 3/3 v4] " Rafael J. Wysocki
  1 sibling, 1 reply; 134+ messages in thread
From: Viresh Kumar @ 2016-02-05  6:50 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Linux Kernel Mailing List, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Steve Muckle, Thomas Gleixner

Will suck some more blood, sorry about that :)

On 05-02-16, 02:28, Rafael J. Wysocki wrote:
> The v3 addresses some review comments from Viresh and a couple of issues found
> by me.  Changes from the previous version:
> - Synchronize gov_cancel_work() with the (new) irq_work properly.
> - Add a comment about the (new) memory barrier.
> - Move samle_delay_ns to "shared" (struct cpu_common_dbs_info) so it is the

sample_delay_ns was already there, you moved last_sample_time instead :)

> @@ -139,7 +141,11 @@ struct cpu_common_dbs_info {
>  	struct mutex timer_mutex;
>  
>  	ktime_t time_stamp;
> +	u64 last_sample_time;
> +	s64 sample_delay_ns;
>  	atomic_t skip_work;
> +	struct irq_work irq_work;

Just for my understanding, why can't we schedule a normal work directly? Is it
because of scheduler's hotpath and queue_work() is slow?

> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> +void gov_set_update_util(struct cpu_common_dbs_info *shared,
> +			 unsigned int delay_us)
>  {
> +	struct cpufreq_policy *policy = shared->policy;
>  	struct dbs_data *dbs_data = policy->governor_data;
> -	struct cpu_dbs_info *cdbs;
>  	int cpu;
>  
> +	shared->sample_delay_ns = delay_us * NSEC_PER_USEC;
> +	shared->time_stamp = ktime_get();
> +	shared->last_sample_time = 0;

Calling this routine from update_sampling_rate() is still wrong. Because that
will also make last_sample_time = 0, which means that we will schedule the
irq-work on the next util update.

We surely didn't wanted that to happen, isn't it ?

>  	for_each_cpu(cpu, policy->cpus) {
> -		cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
> -		cdbs->timer.expires = jiffies + delay;
> -		add_timer_on(&cdbs->timer, cpu);
> +		struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
> +
> +		cpufreq_set_update_util_data(cpu, &cdbs->update_util);
>  	}
>  }
> -EXPORT_SYMBOL_GPL(gov_add_timers);
> +EXPORT_SYMBOL_GPL(gov_set_update_util);

>  void gov_cancel_work(struct cpu_common_dbs_info *shared)
>  {
> -	/* Tell dbs_timer_handler() to skip queuing up work items. */
> +	/* Tell dbs_update_util_handler() to skip queuing up work items. */
>  	atomic_inc(&shared->skip_work);
>  	/*
> -	 * If dbs_timer_handler() is already running, it may not notice the
> -	 * incremented skip_work, so wait for it to complete to prevent its work
> -	 * item from being queued up after the cancel_work_sync() below.
> -	 */
> -	gov_cancel_timers(shared->policy);
> -	/*
> -	 * In case dbs_timer_handler() managed to run and spawn a work item
> -	 * before the timers have been canceled, wait for that work item to
> -	 * complete and then cancel all of the timers set up by it.  If
> -	 * dbs_timer_handler() runs again at that point, it will see the
> -	 * positive value of skip_work and won't spawn any more work items.
> +	 * If dbs_update_util_handler() is already running, it may not notice
> +	 * the incremented skip_work, so wait for it to complete to prevent its
> +	 * work item from being queued up after the cancel_work_sync() below.
>  	 */
> +	gov_clear_update_util(shared->policy);
> +	wait_for_completion(&shared->irq_work_done);

I may be wrong, but isn't running irq_work_sync() enough here instead ?

>  	cancel_work_sync(&shared->work);
> -	gov_cancel_timers(shared->policy);
>  	atomic_set(&shared->skip_work, 0);
>  }
>  EXPORT_SYMBOL_GPL(gov_cancel_work);

> Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> @@ -264,7 +260,7 @@ static void update_sampling_rate(struct
>  		struct od_cpu_dbs_info_s *dbs_info;
>  		struct cpu_dbs_info *cdbs;
>  		struct cpu_common_dbs_info *shared;
> -		unsigned long next_sampling, appointed_at;
> +		ktime_t next_sampling, appointed_at;
>  
>  		dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
>  		cdbs = &dbs_info->cdbs;
> @@ -292,16 +288,19 @@ static void update_sampling_rate(struct
>  			continue;
>  
>  		/*
> -		 * Checking this for any CPU should be fine, timers for all of
> -		 * them are scheduled together.
> +		 * Checking this for any CPU sharing the policy should be fine,
> +		 * they are all scheduled to sample at the same time.
>  		 */
> -		next_sampling = jiffies + usecs_to_jiffies(new_rate);
> -		appointed_at = dbs_info->cdbs.timer.expires;
> +		next_sampling = ktime_add_us(ktime_get(), new_rate);
>  
> -		if (time_before(next_sampling, appointed_at)) {
> -			gov_cancel_work(shared);
> -			gov_add_timers(policy, usecs_to_jiffies(new_rate));
> +		mutex_lock(&shared->timer_mutex);
> +		appointed_at = ktime_add_ns(shared->time_stamp,
> +					    shared->sample_delay_ns);
> +		mutex_unlock(&shared->timer_mutex);
>  
> +		if (ktime_before(next_sampling, appointed_at)) {
> +			gov_cancel_work(shared);
> +			gov_set_update_util(shared, new_rate);

So, I don't think we need to call these heavy routines at all here. Just use the
above timer_mutex to update time_stamp and sample_delay_ns.

Over that, that particular change might turn out to be a big big bonus for us.
Why would we be taking the od_dbs_cdata.mutex in this routine anymore ? We
aren't removing/adding timers anymore, just update the sample_delay_ns and there
shouldn't be any races. Ofcourse you need to use the same timer_mutex in util's
handler as well around sample_delay_ns, I believe.

And that will also kill the circular dependency lockdep we have been chasing
badly :)

Or am I being over excited here ? :(

-- 
viresh

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-05  6:50       ` Viresh Kumar
@ 2016-02-05 13:36         ` Rafael J. Wysocki
  2016-02-05 14:47           ` Viresh Kumar
  2016-02-05 23:01           ` Rafael J. Wysocki
  0 siblings, 2 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-05 13:36 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On Fri, Feb 5, 2016 at 7:50 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> Will suck some more blood, sorry about that :)
>
> On 05-02-16, 02:28, Rafael J. Wysocki wrote:
>> The v3 addresses some review comments from Viresh and a couple of issues found
>> by me.  Changes from the previous version:
>> - Synchronize gov_cancel_work() with the (new) irq_work properly.
>> - Add a comment about the (new) memory barrier.
>> - Move samle_delay_ns to "shared" (struct cpu_common_dbs_info) so it is the
>
> sample_delay_ns was already there, you moved last_sample_time instead :)
>
>> @@ -139,7 +141,11 @@ struct cpu_common_dbs_info {
>>       struct mutex timer_mutex;
>>
>>       ktime_t time_stamp;
>> +     u64 last_sample_time;
>> +     s64 sample_delay_ns;
>>       atomic_t skip_work;
>> +     struct irq_work irq_work;
>
> Just for my understanding, why can't we schedule a normal work directly? Is it
> because of scheduler's hotpath and queue_work() is slow?

No, that's not the reason.

That path can't call wake_up_process() as it may be holding the locks
this would have attempted to grab.

That said it is hot too.  For example, ktime_get() may be too slow to
be called from it on some systems.

>> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
>> +void gov_set_update_util(struct cpu_common_dbs_info *shared,
>> +                      unsigned int delay_us)
>>  {
>> +     struct cpufreq_policy *policy = shared->policy;
>>       struct dbs_data *dbs_data = policy->governor_data;
>> -     struct cpu_dbs_info *cdbs;
>>       int cpu;
>>
>> +     shared->sample_delay_ns = delay_us * NSEC_PER_USEC;
>> +     shared->time_stamp = ktime_get();
>> +     shared->last_sample_time = 0;
>
> Calling this routine from update_sampling_rate() is still wrong. Because that
> will also make last_sample_time = 0, which means that we will schedule the
> irq-work on the next util update.

That isn't a problem, though.

This is the case when the new rate is smaller than the old one and we
want it to take effect immediately.  Taking the next sample
immediately in that case is not going to hurt anyone.

And this observation actually leads to some interesting realization
about update_sampling_rate() (see below).

> We surely didn't wanted that to happen, isn't it ?

No, it isn't. :-)

>
>>       for_each_cpu(cpu, policy->cpus) {
>> -             cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
>> -             cdbs->timer.expires = jiffies + delay;
>> -             add_timer_on(&cdbs->timer, cpu);
>> +             struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
>> +
>> +             cpufreq_set_update_util_data(cpu, &cdbs->update_util);
>>       }
>>  }
>> -EXPORT_SYMBOL_GPL(gov_add_timers);
>> +EXPORT_SYMBOL_GPL(gov_set_update_util);
>
>>  void gov_cancel_work(struct cpu_common_dbs_info *shared)
>>  {
>> -     /* Tell dbs_timer_handler() to skip queuing up work items. */
>> +     /* Tell dbs_update_util_handler() to skip queuing up work items. */
>>       atomic_inc(&shared->skip_work);
>>       /*
>> -      * If dbs_timer_handler() is already running, it may not notice the
>> -      * incremented skip_work, so wait for it to complete to prevent its work
>> -      * item from being queued up after the cancel_work_sync() below.
>> -      */
>> -     gov_cancel_timers(shared->policy);
>> -     /*
>> -      * In case dbs_timer_handler() managed to run and spawn a work item
>> -      * before the timers have been canceled, wait for that work item to
>> -      * complete and then cancel all of the timers set up by it.  If
>> -      * dbs_timer_handler() runs again at that point, it will see the
>> -      * positive value of skip_work and won't spawn any more work items.
>> +      * If dbs_update_util_handler() is already running, it may not notice
>> +      * the incremented skip_work, so wait for it to complete to prevent its
>> +      * work item from being queued up after the cancel_work_sync() below.
>>        */
>> +     gov_clear_update_util(shared->policy);
>> +     wait_for_completion(&shared->irq_work_done);
>
> I may be wrong, but isn't running irq_work_sync() enough here instead ?

Yes, it is.

I assumed that it would only check if the irq_work is running at the
moment for some reason, but that's not the case.

>>       cancel_work_sync(&shared->work);
>> -     gov_cancel_timers(shared->policy);
>>       atomic_set(&shared->skip_work, 0);
>>  }
>>  EXPORT_SYMBOL_GPL(gov_cancel_work);
>
>> Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
>> @@ -264,7 +260,7 @@ static void update_sampling_rate(struct
>>               struct od_cpu_dbs_info_s *dbs_info;
>>               struct cpu_dbs_info *cdbs;
>>               struct cpu_common_dbs_info *shared;
>> -             unsigned long next_sampling, appointed_at;
>> +             ktime_t next_sampling, appointed_at;
>>
>>               dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
>>               cdbs = &dbs_info->cdbs;
>> @@ -292,16 +288,19 @@ static void update_sampling_rate(struct
>>                       continue;
>>
>>               /*
>> -              * Checking this for any CPU should be fine, timers for all of
>> -              * them are scheduled together.
>> +              * Checking this for any CPU sharing the policy should be fine,
>> +              * they are all scheduled to sample at the same time.
>>                */
>> -             next_sampling = jiffies + usecs_to_jiffies(new_rate);
>> -             appointed_at = dbs_info->cdbs.timer.expires;
>> +             next_sampling = ktime_add_us(ktime_get(), new_rate);
>>
>> -             if (time_before(next_sampling, appointed_at)) {
>> -                     gov_cancel_work(shared);
>> -                     gov_add_timers(policy, usecs_to_jiffies(new_rate));
>> +             mutex_lock(&shared->timer_mutex);
>> +             appointed_at = ktime_add_ns(shared->time_stamp,
>> +                                         shared->sample_delay_ns);
>> +             mutex_unlock(&shared->timer_mutex);
>>
>> +             if (ktime_before(next_sampling, appointed_at)) {
>> +                     gov_cancel_work(shared);
>> +                     gov_set_update_util(shared, new_rate);
>
> So, I don't think we need to call these heavy routines at all here. Just use the
> above timer_mutex to update time_stamp and sample_delay_ns.

Well, the concern was that sample_delay_ns might not be updated
atomically on 32-bit architectures and that might be a problem for
dbs_update_util_handler().  However, this really isn't a problem,
because dbs_update_util_handler() only decides whether or not to take
a sample *this* time.  If it sees a semi-update value of
sample_delay_ns, that value will be either too small or too big, so it
will either skip the sample unnecessarily or take it immediately and
none of these is a real problem.  It doesn't hurt to take the sample
immediately at this point (as stated earlier) and if it is skipped, it
will be taken on the next attempt when the update has been completed
(which would have happened anyway had the update been atomic).

> Over that, that particular change might turn out to be a big big bonus for us.
> Why would we be taking the od_dbs_cdata.mutex in this routine anymore ? We
> aren't removing/adding timers anymore, just update the sample_delay_ns and there
> shouldn't be any races.

That's a very good point.

The only concern is that this function walks the entire collection of
cpu_dbs_infos and that's potentially racing with anything that updates
those.

> Of course you need to use the same timer_mutex in util's
> handler as well around sample_delay_ns, I believe.

That can't take any mutexes.  It might only take a raw spinlock if
really needed.

> And that will also kill the circular dependency lockdep we have been chasing
> badly :)
>
> Or am I being over excited here ? :(

Not really.  I think you're on the right track.

Before we drop the lock from here, though, we need to audit the code
for any possible races carefully.

Anyway, I'll send an update of the $subject patch later today when I
have a chance to run it through some test.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-05 13:36         ` Rafael J. Wysocki
@ 2016-02-05 14:47           ` Viresh Kumar
  2016-02-05 23:10             ` Rafael J. Wysocki
  2016-02-05 23:01           ` Rafael J. Wysocki
  1 sibling, 1 reply; 134+ messages in thread
From: Viresh Kumar @ 2016-02-05 14:47 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On Fri, Feb 5, 2016 at 7:06 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:

>>> Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
>>> @@ -264,7 +260,7 @@ static void update_sampling_rate(struct
>>>               struct od_cpu_dbs_info_s *dbs_info;
>>>               struct cpu_dbs_info *cdbs;
>>>               struct cpu_common_dbs_info *shared;
>>> -             unsigned long next_sampling, appointed_at;
>>> +             ktime_t next_sampling, appointed_at;
>>>
>>>               dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
>>>               cdbs = &dbs_info->cdbs;
>>> @@ -292,16 +288,19 @@ static void update_sampling_rate(struct
>>>                       continue;
>>>
>>>               /*
>>> -              * Checking this for any CPU should be fine, timers for all of
>>> -              * them are scheduled together.
>>> +              * Checking this for any CPU sharing the policy should be fine,
>>> +              * they are all scheduled to sample at the same time.
>>>                */
>>> -             next_sampling = jiffies + usecs_to_jiffies(new_rate);
>>> -             appointed_at = dbs_info->cdbs.timer.expires;
>>> +             next_sampling = ktime_add_us(ktime_get(), new_rate);
>>>
>>> -             if (time_before(next_sampling, appointed_at)) {
>>> -                     gov_cancel_work(shared);
>>> -                     gov_add_timers(policy, usecs_to_jiffies(new_rate));
>>> +             mutex_lock(&shared->timer_mutex);
>>> +             appointed_at = ktime_add_ns(shared->time_stamp,
>>> +                                         shared->sample_delay_ns);
>>> +             mutex_unlock(&shared->timer_mutex);
>>>
>>> +             if (ktime_before(next_sampling, appointed_at)) {
>>> +                     gov_cancel_work(shared);
>>> +                     gov_set_update_util(shared, new_rate);
>>
>> So, I don't think we need to call these heavy routines at all here. Just use the
>> above timer_mutex to update time_stamp and sample_delay_ns.
>
> Well, the concern was that sample_delay_ns might not be updated
> atomically on 32-bit architectures and that might be a problem for
> dbs_update_util_handler().  However, this really isn't a problem,
> because dbs_update_util_handler() only decides whether or not to take
> a sample *this* time.  If it sees a semi-update value of
> sample_delay_ns, that value will be either too small or too big, so it
> will either skip the sample unnecessarily or take it immediately and
> none of these is a real problem.  It doesn't hurt to take the sample
> immediately at this point (as stated earlier) and if it is skipped, it
> will be taken on the next attempt when the update has been completed
> (which would have happened anyway had the update been atomic).

Okay, how about this then.

We do some computations here and based on them, conditionally want to
update sample_delay_ns. Because there is no penalty now, in terms of
removing/adding timers/wq, etc, why shouldn't we simply update the
sample_delay_ns everytime without any checks? That would mean that the
change of sampling rate is effective immediately, what can be better than that?

Also, we should do the same from update-sampling-rate of conservative
governor as well.

Just kill all this complex, unwanted code and make life simple.

> The only concern is that this function walks the entire collection of
> cpu_dbs_infos and that's potentially racing with anything that updates
> those.

Yeah, but fixing this race shall be easier than other crazy things we are
looking to do with kobjects :)

> That can't take any mutexes.  It might only take a raw spinlock if
> really needed.

That's doable as well :)

> Before we drop the lock from here, though, we need to audit the code
> for any possible races carefully.

I did bit of that this morning, and there weren't any serious issues as
as far as I could see :)

--
viresh

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-05 13:36         ` Rafael J. Wysocki
  2016-02-05 14:47           ` Viresh Kumar
@ 2016-02-05 23:01           ` Rafael J. Wysocki
  1 sibling, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-05 23:01 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On Friday, February 05, 2016 02:36:54 PM Rafael J. Wysocki wrote:
> On Fri, Feb 5, 2016 at 7:50 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > Will suck some more blood, sorry about that :)
> >
> > On 05-02-16, 02:28, Rafael J. Wysocki wrote:
> >> The v3 addresses some review comments from Viresh and a couple of issues found
> >> by me.  Changes from the previous version:
> >> - Synchronize gov_cancel_work() with the (new) irq_work properly.
> >> - Add a comment about the (new) memory barrier.
> >> - Move samle_delay_ns to "shared" (struct cpu_common_dbs_info) so it is the
> >
> > sample_delay_ns was already there, you moved last_sample_time instead :)
> >
> >> @@ -139,7 +141,11 @@ struct cpu_common_dbs_info {
> >>       struct mutex timer_mutex;
> >>
> >>       ktime_t time_stamp;
> >> +     u64 last_sample_time;
> >> +     s64 sample_delay_ns;
> >>       atomic_t skip_work;
> >> +     struct irq_work irq_work;
> >
> > Just for my understanding, why can't we schedule a normal work directly? Is it
> > because of scheduler's hotpath and queue_work() is slow?
> 
> No, that's not the reason.
> 
> That path can't call wake_up_process() as it may be holding the locks
> this would have attempted to grab.

My answer wasn't really to the point here.

Among other things, the scheduler path cannot use normal spinlocks.  It can
only use raw spinlocks and this means no work queuing from it.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-05 14:47           ` Viresh Kumar
@ 2016-02-05 23:10             ` Rafael J. Wysocki
  2016-02-07  9:10               ` Viresh Kumar
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-05 23:10 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On Friday, February 05, 2016 08:17:56 PM Viresh Kumar wrote:
> On Fri, Feb 5, 2016 at 7:06 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> 
> >>> Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
> >>> @@ -264,7 +260,7 @@ static void update_sampling_rate(struct
> >>>               struct od_cpu_dbs_info_s *dbs_info;
> >>>               struct cpu_dbs_info *cdbs;
> >>>               struct cpu_common_dbs_info *shared;
> >>> -             unsigned long next_sampling, appointed_at;
> >>> +             ktime_t next_sampling, appointed_at;
> >>>
> >>>               dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
> >>>               cdbs = &dbs_info->cdbs;
> >>> @@ -292,16 +288,19 @@ static void update_sampling_rate(struct
> >>>                       continue;
> >>>
> >>>               /*
> >>> -              * Checking this for any CPU should be fine, timers for all of
> >>> -              * them are scheduled together.
> >>> +              * Checking this for any CPU sharing the policy should be fine,
> >>> +              * they are all scheduled to sample at the same time.
> >>>                */
> >>> -             next_sampling = jiffies + usecs_to_jiffies(new_rate);
> >>> -             appointed_at = dbs_info->cdbs.timer.expires;
> >>> +             next_sampling = ktime_add_us(ktime_get(), new_rate);
> >>>
> >>> -             if (time_before(next_sampling, appointed_at)) {
> >>> -                     gov_cancel_work(shared);
> >>> -                     gov_add_timers(policy, usecs_to_jiffies(new_rate));
> >>> +             mutex_lock(&shared->timer_mutex);
> >>> +             appointed_at = ktime_add_ns(shared->time_stamp,
> >>> +                                         shared->sample_delay_ns);
> >>> +             mutex_unlock(&shared->timer_mutex);
> >>>
> >>> +             if (ktime_before(next_sampling, appointed_at)) {
> >>> +                     gov_cancel_work(shared);
> >>> +                     gov_set_update_util(shared, new_rate);
> >>
> >> So, I don't think we need to call these heavy routines at all here. Just use the
> >> above timer_mutex to update time_stamp and sample_delay_ns.
> >
> > Well, the concern was that sample_delay_ns might not be updated
> > atomically on 32-bit architectures and that might be a problem for
> > dbs_update_util_handler().  However, this really isn't a problem,
> > because dbs_update_util_handler() only decides whether or not to take
> > a sample *this* time.  If it sees a semi-update value of
> > sample_delay_ns, that value will be either too small or too big, so it
> > will either skip the sample unnecessarily or take it immediately and
> > none of these is a real problem.  It doesn't hurt to take the sample
> > immediately at this point (as stated earlier) and if it is skipped, it
> > will be taken on the next attempt when the update has been completed
> > (which would have happened anyway had the update been atomic).
> 
> Okay, how about this then.
> 
> We do some computations here and based on them, conditionally want to
> update sample_delay_ns. Because there is no penalty now, in terms of
> removing/adding timers/wq, etc, why shouldn't we simply update the
> sample_delay_ns everytime without any checks? That would mean that the
> change of sampling rate is effective immediately, what can be better than that?

Yes, we can do that.

There is a small concern about updating in parallel with dbs_work_handler()
in which case we may overwrite the (hopefully already correct) sample_delay_ns
value that it has just written, but then it will be corrected next time we
take a sample, so it shouldn't be a big deal.

OK, I'll update the patch to do that.

> Also, we should do the same from update-sampling-rate of conservative
> governor as well.

Let's just not change the whole world in one patch, OK?

> Just kill all this complex, unwanted code and make life simple.
> 
> > The only concern is that this function walks the entire collection of
> > cpu_dbs_infos and that's potentially racing with anything that updates
> > those.
> 
> Yeah, but fixing this race shall be easier than other crazy things we are
> looking to do with kobjects :)

Yes, I agree.

> > That can't take any mutexes.  It might only take a raw spinlock if
> > really needed.
> 
> That's doable as well :)
> 
> > Before we drop the lock from here, though, we need to audit the code
> > for any possible races carefully.
> 
> I did bit of that this morning, and there weren't any serious issues as
> as far as I could see :)

The case I'm mostly concerned about is when update_sampling_rate() looks
at a CPU with a policy completely unrelated to the dbs_data it was called
for.  In that case the "shared" object may just go away from under it at
any time while it is looking at that object in theory.

The existing code has this problem AFAICS and the reason why we don't see
any breakage from it right now is because the granularity of cdata->mutex
is really coarse.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH 3/3 v4] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-05  1:28     ` [PATCH 3/3 v3] " Rafael J. Wysocki
  2016-02-05  6:50       ` Viresh Kumar
@ 2016-02-06  3:40       ` Rafael J. Wysocki
  2016-02-07  9:20         ` Viresh Kumar
  2016-02-07 14:50         ` [PATCH 3/3 v5] " Rafael J. Wysocki
  1 sibling, 2 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-06  3:40 UTC (permalink / raw)
  To: Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Instead of using a per-CPU deferrable timer for queuing up governor
work items, register a utilization update callback that will be
invoked from the scheduler on utilization changes.

The sampling rate is still the same as what was used for the
deferrable timers and the added irq_work overhead should be offset by
the eliminated timers overhead, so in theory the functional impact of
this patch should not be significant.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

Updated after the recent discussion with Viresh.

Changes from v3:
- The completion used for irq_work synchronization replaced with irq_work_sync()
  in gov_cancel_work().
- update_sampling_rate() now modifies shared->sample_delay_ns for all CPUs
  where it matters directly with a big fat comment explaining why this is
  actually OK.
- The above means the time_stamp field in struct cpu_common_dbs_info is not
  necessary any more, so it is dropped.
- A build error for !CONFIG_SMP is addressed (hopefully effectively).

This version was lightly tested on an x86 laptop.

Thanks!

---
 drivers/cpufreq/cpufreq_conservative.c |    6 -
 drivers/cpufreq/cpufreq_governor.c     |  164 +++++++++++++++------------------
 drivers/cpufreq/cpufreq_governor.h     |   19 ++-
 drivers/cpufreq/cpufreq_ondemand.c     |   43 ++++----
 4 files changed, 112 insertions(+), 120 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -18,6 +18,7 @@
 #define _CPUFREQ_GOVERNOR_H
 
 #include <linux/atomic.h>
+#include <linux/irq_work.h>
 #include <linux/cpufreq.h>
 #include <linux/kernel_stat.h>
 #include <linux/module.h>
@@ -138,11 +139,19 @@ struct cpu_common_dbs_info {
 	 */
 	struct mutex timer_mutex;
 
-	ktime_t time_stamp;
+	u64 last_sample_time;
+	s64 sample_delay_ns;
 	atomic_t skip_work;
+	struct irq_work irq_work;
 	struct work_struct work;
 };
 
+static inline void gov_update_sample_delay(struct cpu_common_dbs_info *shared,
+					   unsigned int delay_us)
+{
+	shared->sample_delay_ns = delay_us * NSEC_PER_USEC;
+}
+
 /* Per cpu structures */
 struct cpu_dbs_info {
 	u64 prev_cpu_idle;
@@ -155,7 +164,7 @@ struct cpu_dbs_info {
 	 * wake-up from idle.
 	 */
 	unsigned int prev_load;
-	struct timer_list timer;
+	struct update_util_data update_util;
 	struct cpu_common_dbs_info *shared;
 };
 
@@ -212,8 +221,7 @@ struct common_dbs_data {
 
 	struct cpu_dbs_info *(*get_cpu_cdbs)(int cpu);
 	void *(*get_cpu_dbs_info_s)(int cpu);
-	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy,
-				      bool modify_all);
+	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy);
 	void (*gov_check_cpu)(int cpu, unsigned int load);
 	int (*init)(struct dbs_data *dbs_data, bool notify);
 	void (*exit)(struct dbs_data *dbs_data, bool notify);
@@ -270,9 +278,6 @@ static ssize_t show_sampling_rate_min_go
 }
 
 extern struct mutex cpufreq_governor_lock;
-
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay);
-void gov_cancel_work(struct cpu_common_dbs_info *shared);
 void dbs_check_cpu(struct dbs_data *dbs_data, int cpu);
 int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 		struct common_dbs_data *cdata, unsigned int event);
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -128,10 +128,10 @@ void dbs_check_cpu(struct dbs_data *dbs_
 		 * dropped down. So we perform the copy only once, upon the
 		 * first wake-up from idle.)
 		 *
-		 * Detecting this situation is easy: the governor's deferrable
-		 * timer would not have fired during CPU-idle periods. Hence
-		 * an unusually large 'wall_time' (as compared to the sampling
-		 * rate) indicates this scenario.
+		 * Detecting this situation is easy: the governor's utilization
+		 * update handler would not have run during CPU-idle periods.
+		 * Hence, an unusually large 'wall_time' (as compared to the
+		 * sampling rate) indicates this scenario.
 		 *
 		 * prev_load can be zero in two cases and we must recalculate it
 		 * for both cases:
@@ -161,72 +161,48 @@ void dbs_check_cpu(struct dbs_data *dbs_
 }
 EXPORT_SYMBOL_GPL(dbs_check_cpu);
 
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay)
+void gov_set_update_util(struct cpu_common_dbs_info *shared,
+			 unsigned int delay_us)
 {
+	struct cpufreq_policy *policy = shared->policy;
 	struct dbs_data *dbs_data = policy->governor_data;
-	struct cpu_dbs_info *cdbs;
 	int cpu;
 
+	gov_update_sample_delay(shared, delay_us);
+	shared->last_sample_time = 0;
+
 	for_each_cpu(cpu, policy->cpus) {
-		cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
-		cdbs->timer.expires = jiffies + delay;
-		add_timer_on(&cdbs->timer, cpu);
+		struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
+
+		cpufreq_set_update_util_data(cpu, &cdbs->update_util);
 	}
 }
-EXPORT_SYMBOL_GPL(gov_add_timers);
+EXPORT_SYMBOL_GPL(gov_set_update_util);
 
-static inline void gov_cancel_timers(struct cpufreq_policy *policy)
+static inline void gov_clear_update_util(struct cpufreq_policy *policy)
 {
-	struct dbs_data *dbs_data = policy->governor_data;
-	struct cpu_dbs_info *cdbs;
 	int i;
 
-	for_each_cpu(i, policy->cpus) {
-		cdbs = dbs_data->cdata->get_cpu_cdbs(i);
-		del_timer_sync(&cdbs->timer);
-	}
+	for_each_cpu(i, policy->cpus)
+		cpufreq_set_update_util_data(i, NULL);
+
+	synchronize_rcu();
 }
 
-void gov_cancel_work(struct cpu_common_dbs_info *shared)
+static void gov_cancel_work(struct cpu_common_dbs_info *shared)
 {
-	/* Tell dbs_timer_handler() to skip queuing up work items. */
+	/* Tell dbs_update_util_handler() to skip queuing up work items. */
 	atomic_inc(&shared->skip_work);
 	/*
-	 * If dbs_timer_handler() is already running, it may not notice the
-	 * incremented skip_work, so wait for it to complete to prevent its work
-	 * item from being queued up after the cancel_work_sync() below.
-	 */
-	gov_cancel_timers(shared->policy);
-	/*
-	 * In case dbs_timer_handler() managed to run and spawn a work item
-	 * before the timers have been canceled, wait for that work item to
-	 * complete and then cancel all of the timers set up by it.  If
-	 * dbs_timer_handler() runs again at that point, it will see the
-	 * positive value of skip_work and won't spawn any more work items.
+	 * If dbs_update_util_handler() is already running, it may not notice
+	 * the incremented skip_work, so wait for it to complete to prevent its
+	 * work item from being queued up after the cancel_work_sync() below.
 	 */
+	gov_clear_update_util(shared->policy);
+	irq_work_sync(&shared->irq_work);
 	cancel_work_sync(&shared->work);
-	gov_cancel_timers(shared->policy);
 	atomic_set(&shared->skip_work, 0);
 }
-EXPORT_SYMBOL_GPL(gov_cancel_work);
-
-/* Will return if we need to evaluate cpu load again or not */
-static bool need_load_eval(struct cpu_common_dbs_info *shared,
-			   unsigned int sampling_rate)
-{
-	if (policy_is_shared(shared->policy)) {
-		ktime_t time_now = ktime_get();
-		s64 delta_us = ktime_us_delta(time_now, shared->time_stamp);
-
-		/* Do nothing if we recently have sampled */
-		if (delta_us < (s64)(sampling_rate / 2))
-			return false;
-		else
-			shared->time_stamp = time_now;
-	}
-
-	return true;
-}
 
 static void dbs_work_handler(struct work_struct *work)
 {
@@ -234,56 +210,69 @@ static void dbs_work_handler(struct work
 					cpu_common_dbs_info, work);
 	struct cpufreq_policy *policy;
 	struct dbs_data *dbs_data;
-	unsigned int sampling_rate, delay;
-	bool eval_load;
+	unsigned int delay;
 
 	policy = shared->policy;
 	dbs_data = policy->governor_data;
 
-	/* Kill all timers */
-	gov_cancel_timers(policy);
-
-	if (dbs_data->cdata->governor == GOV_CONSERVATIVE) {
-		struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
-
-		sampling_rate = cs_tuners->sampling_rate;
-	} else {
-		struct od_dbs_tuners *od_tuners = dbs_data->tuners;
-
-		sampling_rate = od_tuners->sampling_rate;
-	}
-
-	eval_load = need_load_eval(shared, sampling_rate);
-
 	/*
-	 * Make sure cpufreq_governor_limits() isn't evaluating load in
-	 * parallel.
+	 * Make sure cpufreq_governor_limits() isn't evaluating load or the
+	 * ondemand governor isn't updating the sampling rate in parallel.
 	 */
 	mutex_lock(&shared->timer_mutex);
-	delay = dbs_data->cdata->gov_dbs_timer(policy, eval_load);
+	delay = dbs_data->cdata->gov_dbs_timer(policy);
+	shared->sample_delay_ns = jiffies_to_nsecs(delay);
 	mutex_unlock(&shared->timer_mutex);
 
+	/*
+	 * If the atomic operation below is reordered with respect to the
+	 * sample delay modification, the utilization update handler may end
+	 * up using a stale sample delay value.
+	 */
+	smp_mb__before_atomic();
 	atomic_dec(&shared->skip_work);
+}
+
+static void dbs_irq_work(struct irq_work *irq_work)
+{
+	struct cpu_common_dbs_info *shared;
+
+	shared = container_of(irq_work, struct cpu_common_dbs_info, irq_work);
+	schedule_work(&shared->work);
+}
 
-	gov_add_timers(policy, delay);
+static inline void gov_queue_irq_work(struct cpu_common_dbs_info *shared)
+{
+	if (IS_ENABLED(CONFIG_SMP))
+		irq_work_queue_on(&shared->irq_work, smp_processor_id());
+	else
+		irq_work_queue(&shared->irq_work);
 }
 
-static void dbs_timer_handler(unsigned long data)
+static void dbs_update_util_handler(struct update_util_data *data, u64 time,
+				    unsigned long util, unsigned long max)
 {
-	struct cpu_dbs_info *cdbs = (struct cpu_dbs_info *)data;
+	struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
 	struct cpu_common_dbs_info *shared = cdbs->shared;
 
 	/*
-	 * Timer handler may not be allowed to queue the work at the moment,
-	 * because:
-	 * - Another timer handler has done that
-	 * - We are stopping the governor
-	 * - Or we are updating the sampling rate of the ondemand governor
+	 * The work may not be allowed to be queued up right now.
+	 * Possible reasons:
+	 * - Work has already been queued up or is in progress.
+	 * - The governor is being stopped.
+	 * - It is too early (too little time from the previous sample).
 	 */
-	if (atomic_inc_return(&shared->skip_work) > 1)
-		atomic_dec(&shared->skip_work);
-	else
-		queue_work(system_wq, &shared->work);
+	if (atomic_inc_return(&shared->skip_work) == 1) {
+		u64 delta_ns;
+
+		delta_ns = time - shared->last_sample_time;
+		if ((s64)delta_ns >= shared->sample_delay_ns) {
+			shared->last_sample_time = time;
+			gov_queue_irq_work(shared);
+			return;
+		}
+	}
+	atomic_dec(&shared->skip_work);
 }
 
 static void set_sampling_rate(struct dbs_data *dbs_data,
@@ -467,9 +456,6 @@ static int cpufreq_governor_start(struct
 		io_busy = od_tuners->io_is_busy;
 	}
 
-	shared->policy = policy;
-	shared->time_stamp = ktime_get();
-
 	for_each_cpu(j, policy->cpus) {
 		struct cpu_dbs_info *j_cdbs = cdata->get_cpu_cdbs(j);
 		unsigned int prev_load;
@@ -485,10 +471,10 @@ static int cpufreq_governor_start(struct
 		if (ignore_nice)
 			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 
-		__setup_timer(&j_cdbs->timer, dbs_timer_handler,
-			      (unsigned long)j_cdbs,
-			      TIMER_DEFERRABLE | TIMER_IRQSAFE);
+		j_cdbs->update_util.func = dbs_update_util_handler;
 	}
+	shared->policy = policy;
+	init_irq_work(&shared->irq_work, dbs_irq_work);
 
 	if (cdata->governor == GOV_CONSERVATIVE) {
 		struct cs_cpu_dbs_info_s *cs_dbs_info =
@@ -505,7 +491,7 @@ static int cpufreq_governor_start(struct
 		od_ops->powersave_bias_init_cpu(cpu);
 	}
 
-	gov_add_timers(policy, delay_for_sampling_rate(sampling_rate));
+	gov_set_update_util(shared, sampling_rate);
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -191,7 +191,7 @@ static void od_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int od_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int od_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	unsigned int cpu = policy->cpu;
@@ -200,9 +200,6 @@ static unsigned int od_dbs_timer(struct
 	struct od_dbs_tuners *od_tuners = dbs_data->tuners;
 	int delay = 0, sample_type = dbs_info->sample_type;
 
-	if (!modify_all)
-		goto max_delay;
-
 	/* Common NORMAL_SAMPLE setup */
 	dbs_info->sample_type = OD_NORMAL_SAMPLE;
 	if (sample_type == OD_SUB_SAMPLE) {
@@ -218,7 +215,6 @@ static unsigned int od_dbs_timer(struct
 		}
 	}
 
-max_delay:
 	if (!delay)
 		delay = delay_for_sampling_rate(od_tuners->sampling_rate
 				* dbs_info->rate_mult);
@@ -264,7 +260,6 @@ static void update_sampling_rate(struct
 		struct od_cpu_dbs_info_s *dbs_info;
 		struct cpu_dbs_info *cdbs;
 		struct cpu_common_dbs_info *shared;
-		unsigned long next_sampling, appointed_at;
 
 		dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
 		cdbs = &dbs_info->cdbs;
@@ -288,20 +283,28 @@ static void update_sampling_rate(struct
 		 * policy will be governed by dbs_data, otherwise there can be
 		 * multiple policies that are governed by the same dbs_data.
 		 */
-		if (dbs_data != policy->governor_data)
-			continue;
-
-		/*
-		 * Checking this for any CPU should be fine, timers for all of
-		 * them are scheduled together.
-		 */
-		next_sampling = jiffies + usecs_to_jiffies(new_rate);
-		appointed_at = dbs_info->cdbs.timer.expires;
-
-		if (time_before(next_sampling, appointed_at)) {
-			gov_cancel_work(shared);
-			gov_add_timers(policy, usecs_to_jiffies(new_rate));
-
+		if (dbs_data == policy->governor_data) {
+			mutex_lock(&shared->timer_mutex);
+			/*
+			 * On 32-bit architectures this may race with the
+			 * sample_delay_ns read in dbs_update_util_handler(),
+			 * but that really doesn't matter.  If the read returns
+			 * a value that's too big, the sample will be skipped,
+			 * but the next invocation of dbs_update_util_handler()
+			 * (when the update has been completed) will take a
+			 * sample.  If the returned value is too small, the
+			 * sample will be taken immediately, but that isn't a
+			 * problem, as we want the new rate to take effect
+			 * immediately anyway.
+			 *
+			 * If this runs in parallel with dbs_work_handler(), we
+			 * may end up overwriting the sample_delay_ns value that
+			 * it has just written, but the difference should not be
+			 * too big and it will be corrected next time a sample
+			 * is taken, so it shouldn't be significant.
+			 */
+			gov_update_sample_delay(shared, new_rate);
+			mutex_unlock(&shared->timer_mutex);
 		}
 	}
 
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -115,14 +115,12 @@ static void cs_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int cs_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int cs_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
-	if (modify_all)
-		dbs_check_cpu(dbs_data, policy->cpu);
-
+	dbs_check_cpu(dbs_data, policy->cpu);
 	return delay_for_sampling_rate(cs_tuners->sampling_rate);
 }
 

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-05 23:10             ` Rafael J. Wysocki
@ 2016-02-07  9:10               ` Viresh Kumar
  2016-02-07 14:43                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Viresh Kumar @ 2016-02-07  9:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On 06-02-16, 00:10, Rafael J. Wysocki wrote:
> On Friday, February 05, 2016 08:17:56 PM Viresh Kumar wrote:
> > Okay, how about this then.
> > 
> > We do some computations here and based on them, conditionally want to
> > update sample_delay_ns. Because there is no penalty now, in terms of
> > removing/adding timers/wq, etc, why shouldn't we simply update the
> > sample_delay_ns everytime without any checks? That would mean that the
> > change of sampling rate is effective immediately, what can be better than that?
> 
> Yes, we can do that.
> 
> There is a small concern about updating in parallel with dbs_work_handler()
> in which case we may overwrite the (hopefully already correct) sample_delay_ns
> value that it has just written, but then it will be corrected next time we
> take a sample, so it shouldn't be a big deal.
> 
> OK, I'll update the patch to do that.

Great.

> > Also, we should do the same from update-sampling-rate of conservative
> > governor as well.
> 
> Let's just not change the whole world in one patch, OK?

Yeah, I wasn't asking to update in the same patch, but just that we
should do that as well.

> > I did bit of that this morning, and there weren't any serious issues as
> > as far as I could see :)
> 
> The case I'm mostly concerned about is when update_sampling_rate() looks
> at a CPU with a policy completely unrelated to the dbs_data it was called
> for.  In that case the "shared" object may just go away from under it at
> any time while it is looking at that object in theory.

Right, a way (ofcourse we should try find something better) is to move
that update to a separate work item, just as I did it in my patch..

But, I am quite sure we can get that fixed.

-- 
viresh

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v4] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-06  3:40       ` [PATCH 3/3 v4] " Rafael J. Wysocki
@ 2016-02-07  9:20         ` Viresh Kumar
  2016-02-07 14:36           ` Rafael J. Wysocki
  2016-02-07 14:50         ` [PATCH 3/3 v5] " Rafael J. Wysocki
  1 sibling, 1 reply; 134+ messages in thread
From: Viresh Kumar @ 2016-02-07  9:20 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Linux Kernel Mailing List, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Steve Muckle, Thomas Gleixner

On 06-02-16, 04:40, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Instead of using a per-CPU deferrable timer for queuing up governor
> work items, register a utilization update callback that will be
> invoked from the scheduler on utilization changes.
> 
> The sampling rate is still the same as what was used for the
> deferrable timers and the added irq_work overhead should be offset by
> the eliminated timers overhead, so in theory the functional impact of
> this patch should not be significant.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> Updated after the recent discussion with Viresh.
> 
> Changes from v3:
> - The completion used for irq_work synchronization replaced with irq_work_sync()
>   in gov_cancel_work().
> - update_sampling_rate() now modifies shared->sample_delay_ns for all CPUs
>   where it matters directly with a big fat comment explaining why this is
>   actually OK.
> - The above means the time_stamp field in struct cpu_common_dbs_info is not
>   necessary any more, so it is dropped.
> - A build error for !CONFIG_SMP is addressed (hopefully effectively).
> 
> This version was lightly tested on an x86 laptop.

Awesome work Rafael, this looks really good now.

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v4] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-07  9:20         ` Viresh Kumar
@ 2016-02-07 14:36           ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-07 14:36 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Linux PM list, Linux Kernel Mailing List, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Steve Muckle, Thomas Gleixner

On Sunday, February 07, 2016 02:50:19 PM Viresh Kumar wrote:
> On 06-02-16, 04:40, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > Instead of using a per-CPU deferrable timer for queuing up governor
> > work items, register a utilization update callback that will be
> > invoked from the scheduler on utilization changes.
> > 
> > The sampling rate is still the same as what was used for the
> > deferrable timers and the added irq_work overhead should be offset by
> > the eliminated timers overhead, so in theory the functional impact of
> > this patch should not be significant.
> > 
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> > 
> > Updated after the recent discussion with Viresh.
> > 
> > Changes from v3:
> > - The completion used for irq_work synchronization replaced with irq_work_sync()
> >   in gov_cancel_work().
> > - update_sampling_rate() now modifies shared->sample_delay_ns for all CPUs
> >   where it matters directly with a big fat comment explaining why this is
> >   actually OK.
> > - The above means the time_stamp field in struct cpu_common_dbs_info is not
> >   necessary any more, so it is dropped.
> > - A build error for !CONFIG_SMP is addressed (hopefully effectively).
> > 
> > This version was lightly tested on an x86 laptop.
> 
> Awesome work Rafael, this looks really good now.
> 
> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

Thanks!

I have one small update, though.  Namely, it is more logical to initialize
irq_work along with doing INIT_WORK() on the main work item.

I'll send it in a while.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-07  9:10               ` Viresh Kumar
@ 2016-02-07 14:43                 ` Rafael J. Wysocki
  2016-02-08  2:08                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-07 14:43 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On Sunday, February 07, 2016 02:40:40 PM Viresh Kumar wrote:
> On 06-02-16, 00:10, Rafael J. Wysocki wrote:
> > On Friday, February 05, 2016 08:17:56 PM Viresh Kumar wrote:
> > > Okay, how about this then.
> > > 
> > > We do some computations here and based on them, conditionally want to
> > > update sample_delay_ns. Because there is no penalty now, in terms of
> > > removing/adding timers/wq, etc, why shouldn't we simply update the
> > > sample_delay_ns everytime without any checks? That would mean that the
> > > change of sampling rate is effective immediately, what can be better than that?
> > 
> > Yes, we can do that.
> > 
> > There is a small concern about updating in parallel with dbs_work_handler()
> > in which case we may overwrite the (hopefully already correct) sample_delay_ns
> > value that it has just written, but then it will be corrected next time we
> > take a sample, so it shouldn't be a big deal.
> > 
> > OK, I'll update the patch to do that.
> 
> Great.
> 
> > > Also, we should do the same from update-sampling-rate of conservative
> > > governor as well.
> > 
> > Let's just not change the whole world in one patch, OK?
> 
> Yeah, I wasn't asking to update in the same patch, but just that we
> should do that as well.
> 
> > > I did bit of that this morning, and there weren't any serious issues as
> > > as far as I could see :)
> > 
> > The case I'm mostly concerned about is when update_sampling_rate() looks
> > at a CPU with a policy completely unrelated to the dbs_data it was called
> > for.  In that case the "shared" object may just go away from under it at
> > any time while it is looking at that object in theory.
> 
> Right, a way (ofcourse we should try find something better) is to move
> that update to a separate work item, just as I did it in my patch..

No, it isn't.  Trying to do it asynchronously will only lead to more
concurrency-related issues.

> But, I am quite sure we can get that fixed.

What we need to do, is to make it possible for update_sampling_rate()
to walk all of the cpu_dbs_infos and look at what their policy_dbs
fields point to safely.

After my cleanup patches it does that under dbs_data_mutex and that works,
because this mutex is also held around *any* updates of struct cpu_dbs_info
anywhere.

However, the cpu_dbs_infos themselves are actually static, so they can be
accessed at any time.  It looks like, then, we may just need to add a lock to
each of them to ensure that the policy_dbs thing won't go away suddenly and
we may not need dbs_data_mutex in there any more.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH 3/3 v5] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-06  3:40       ` [PATCH 3/3 v4] " Rafael J. Wysocki
  2016-02-07  9:20         ` Viresh Kumar
@ 2016-02-07 14:50         ` Rafael J. Wysocki
  2016-02-07 15:36           ` Viresh Kumar
  2016-02-09 10:01           ` Gautham R Shenoy
  1 sibling, 2 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-07 14:50 UTC (permalink / raw)
  To: Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Instead of using a per-CPU deferrable timer for queuing up governor
work items, register a utilization update callback that will be
invoked from the scheduler on utilization changes.

The sampling rate is still the same as what was used for the
deferrable timers and the added irq_work overhead should be offset by
the eliminated timers overhead, so in theory the functional impact of
this patch should not be significant.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
---

Changes from v4:
- Moved init_irq_work() to alloc_common_dbs_info() so it is done along with
  the INIT_WORK() on the main work structure (which seems more logical to me).
- Added the ACK from Viresh (in the hope that it still applied).

Thanks,
Rafael

---
 drivers/cpufreq/cpufreq_conservative.c |    6 -
 drivers/cpufreq/cpufreq_governor.c     |  164 +++++++++++++++------------------
 drivers/cpufreq/cpufreq_governor.h     |   19 ++-
 drivers/cpufreq/cpufreq_ondemand.c     |   43 ++++----
 4 files changed, 112 insertions(+), 120 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -18,6 +18,7 @@
 #define _CPUFREQ_GOVERNOR_H
 
 #include <linux/atomic.h>
+#include <linux/irq_work.h>
 #include <linux/cpufreq.h>
 #include <linux/kernel_stat.h>
 #include <linux/module.h>
@@ -138,11 +139,19 @@ struct cpu_common_dbs_info {
 	 */
 	struct mutex timer_mutex;
 
-	ktime_t time_stamp;
+	u64 last_sample_time;
+	s64 sample_delay_ns;
 	atomic_t skip_work;
+	struct irq_work irq_work;
 	struct work_struct work;
 };
 
+static inline void gov_update_sample_delay(struct cpu_common_dbs_info *shared,
+					   unsigned int delay_us)
+{
+	shared->sample_delay_ns = delay_us * NSEC_PER_USEC;
+}
+
 /* Per cpu structures */
 struct cpu_dbs_info {
 	u64 prev_cpu_idle;
@@ -155,7 +164,7 @@ struct cpu_dbs_info {
 	 * wake-up from idle.
 	 */
 	unsigned int prev_load;
-	struct timer_list timer;
+	struct update_util_data update_util;
 	struct cpu_common_dbs_info *shared;
 };
 
@@ -212,8 +221,7 @@ struct common_dbs_data {
 
 	struct cpu_dbs_info *(*get_cpu_cdbs)(int cpu);
 	void *(*get_cpu_dbs_info_s)(int cpu);
-	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy,
-				      bool modify_all);
+	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy);
 	void (*gov_check_cpu)(int cpu, unsigned int load);
 	int (*init)(struct dbs_data *dbs_data, bool notify);
 	void (*exit)(struct dbs_data *dbs_data, bool notify);
@@ -270,9 +278,6 @@ static ssize_t show_sampling_rate_min_go
 }
 
 extern struct mutex cpufreq_governor_lock;
-
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay);
-void gov_cancel_work(struct cpu_common_dbs_info *shared);
 void dbs_check_cpu(struct dbs_data *dbs_data, int cpu);
 int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 		struct common_dbs_data *cdata, unsigned int event);
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -128,10 +128,10 @@ void dbs_check_cpu(struct dbs_data *dbs_
 		 * dropped down. So we perform the copy only once, upon the
 		 * first wake-up from idle.)
 		 *
-		 * Detecting this situation is easy: the governor's deferrable
-		 * timer would not have fired during CPU-idle periods. Hence
-		 * an unusually large 'wall_time' (as compared to the sampling
-		 * rate) indicates this scenario.
+		 * Detecting this situation is easy: the governor's utilization
+		 * update handler would not have run during CPU-idle periods.
+		 * Hence, an unusually large 'wall_time' (as compared to the
+		 * sampling rate) indicates this scenario.
 		 *
 		 * prev_load can be zero in two cases and we must recalculate it
 		 * for both cases:
@@ -161,72 +161,48 @@ void dbs_check_cpu(struct dbs_data *dbs_
 }
 EXPORT_SYMBOL_GPL(dbs_check_cpu);
 
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay)
+void gov_set_update_util(struct cpu_common_dbs_info *shared,
+			 unsigned int delay_us)
 {
+	struct cpufreq_policy *policy = shared->policy;
 	struct dbs_data *dbs_data = policy->governor_data;
-	struct cpu_dbs_info *cdbs;
 	int cpu;
 
+	gov_update_sample_delay(shared, delay_us);
+	shared->last_sample_time = 0;
+
 	for_each_cpu(cpu, policy->cpus) {
-		cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
-		cdbs->timer.expires = jiffies + delay;
-		add_timer_on(&cdbs->timer, cpu);
+		struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
+
+		cpufreq_set_update_util_data(cpu, &cdbs->update_util);
 	}
 }
-EXPORT_SYMBOL_GPL(gov_add_timers);
+EXPORT_SYMBOL_GPL(gov_set_update_util);
 
-static inline void gov_cancel_timers(struct cpufreq_policy *policy)
+static inline void gov_clear_update_util(struct cpufreq_policy *policy)
 {
-	struct dbs_data *dbs_data = policy->governor_data;
-	struct cpu_dbs_info *cdbs;
 	int i;
 
-	for_each_cpu(i, policy->cpus) {
-		cdbs = dbs_data->cdata->get_cpu_cdbs(i);
-		del_timer_sync(&cdbs->timer);
-	}
+	for_each_cpu(i, policy->cpus)
+		cpufreq_set_update_util_data(i, NULL);
+
+	synchronize_rcu();
 }
 
-void gov_cancel_work(struct cpu_common_dbs_info *shared)
+static void gov_cancel_work(struct cpu_common_dbs_info *shared)
 {
-	/* Tell dbs_timer_handler() to skip queuing up work items. */
+	/* Tell dbs_update_util_handler() to skip queuing up work items. */
 	atomic_inc(&shared->skip_work);
 	/*
-	 * If dbs_timer_handler() is already running, it may not notice the
-	 * incremented skip_work, so wait for it to complete to prevent its work
-	 * item from being queued up after the cancel_work_sync() below.
-	 */
-	gov_cancel_timers(shared->policy);
-	/*
-	 * In case dbs_timer_handler() managed to run and spawn a work item
-	 * before the timers have been canceled, wait for that work item to
-	 * complete and then cancel all of the timers set up by it.  If
-	 * dbs_timer_handler() runs again at that point, it will see the
-	 * positive value of skip_work and won't spawn any more work items.
+	 * If dbs_update_util_handler() is already running, it may not notice
+	 * the incremented skip_work, so wait for it to complete to prevent its
+	 * work item from being queued up after the cancel_work_sync() below.
 	 */
+	gov_clear_update_util(shared->policy);
+	irq_work_sync(&shared->irq_work);
 	cancel_work_sync(&shared->work);
-	gov_cancel_timers(shared->policy);
 	atomic_set(&shared->skip_work, 0);
 }
-EXPORT_SYMBOL_GPL(gov_cancel_work);
-
-/* Will return if we need to evaluate cpu load again or not */
-static bool need_load_eval(struct cpu_common_dbs_info *shared,
-			   unsigned int sampling_rate)
-{
-	if (policy_is_shared(shared->policy)) {
-		ktime_t time_now = ktime_get();
-		s64 delta_us = ktime_us_delta(time_now, shared->time_stamp);
-
-		/* Do nothing if we recently have sampled */
-		if (delta_us < (s64)(sampling_rate / 2))
-			return false;
-		else
-			shared->time_stamp = time_now;
-	}
-
-	return true;
-}
 
 static void dbs_work_handler(struct work_struct *work)
 {
@@ -234,56 +210,69 @@ static void dbs_work_handler(struct work
 					cpu_common_dbs_info, work);
 	struct cpufreq_policy *policy;
 	struct dbs_data *dbs_data;
-	unsigned int sampling_rate, delay;
-	bool eval_load;
+	unsigned int delay;
 
 	policy = shared->policy;
 	dbs_data = policy->governor_data;
 
-	/* Kill all timers */
-	gov_cancel_timers(policy);
-
-	if (dbs_data->cdata->governor == GOV_CONSERVATIVE) {
-		struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
-
-		sampling_rate = cs_tuners->sampling_rate;
-	} else {
-		struct od_dbs_tuners *od_tuners = dbs_data->tuners;
-
-		sampling_rate = od_tuners->sampling_rate;
-	}
-
-	eval_load = need_load_eval(shared, sampling_rate);
-
 	/*
-	 * Make sure cpufreq_governor_limits() isn't evaluating load in
-	 * parallel.
+	 * Make sure cpufreq_governor_limits() isn't evaluating load or the
+	 * ondemand governor isn't updating the sampling rate in parallel.
 	 */
 	mutex_lock(&shared->timer_mutex);
-	delay = dbs_data->cdata->gov_dbs_timer(policy, eval_load);
+	delay = dbs_data->cdata->gov_dbs_timer(policy);
+	shared->sample_delay_ns = jiffies_to_nsecs(delay);
 	mutex_unlock(&shared->timer_mutex);
 
+	/*
+	 * If the atomic operation below is reordered with respect to the
+	 * sample delay modification, the utilization update handler may end
+	 * up using a stale sample delay value.
+	 */
+	smp_mb__before_atomic();
 	atomic_dec(&shared->skip_work);
+}
+
+static void dbs_irq_work(struct irq_work *irq_work)
+{
+	struct cpu_common_dbs_info *shared;
+
+	shared = container_of(irq_work, struct cpu_common_dbs_info, irq_work);
+	schedule_work(&shared->work);
+}
 
-	gov_add_timers(policy, delay);
+static inline void gov_queue_irq_work(struct cpu_common_dbs_info *shared)
+{
+	if (IS_ENABLED(CONFIG_SMP))
+		irq_work_queue_on(&shared->irq_work, smp_processor_id());
+	else
+		irq_work_queue(&shared->irq_work);
 }
 
-static void dbs_timer_handler(unsigned long data)
+static void dbs_update_util_handler(struct update_util_data *data, u64 time,
+				    unsigned long util, unsigned long max)
 {
-	struct cpu_dbs_info *cdbs = (struct cpu_dbs_info *)data;
+	struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
 	struct cpu_common_dbs_info *shared = cdbs->shared;
 
 	/*
-	 * Timer handler may not be allowed to queue the work at the moment,
-	 * because:
-	 * - Another timer handler has done that
-	 * - We are stopping the governor
-	 * - Or we are updating the sampling rate of the ondemand governor
+	 * The work may not be allowed to be queued up right now.
+	 * Possible reasons:
+	 * - Work has already been queued up or is in progress.
+	 * - The governor is being stopped.
+	 * - It is too early (too little time from the previous sample).
 	 */
-	if (atomic_inc_return(&shared->skip_work) > 1)
-		atomic_dec(&shared->skip_work);
-	else
-		queue_work(system_wq, &shared->work);
+	if (atomic_inc_return(&shared->skip_work) == 1) {
+		u64 delta_ns;
+
+		delta_ns = time - shared->last_sample_time;
+		if ((s64)delta_ns >= shared->sample_delay_ns) {
+			shared->last_sample_time = time;
+			gov_queue_irq_work(shared);
+			return;
+		}
+	}
+	atomic_dec(&shared->skip_work);
 }
 
 static void set_sampling_rate(struct dbs_data *dbs_data,
@@ -315,6 +304,7 @@ static int alloc_common_dbs_info(struct
 
 	mutex_init(&shared->timer_mutex);
 	atomic_set(&shared->skip_work, 0);
+	init_irq_work(&shared->irq_work, dbs_irq_work);
 	INIT_WORK(&shared->work, dbs_work_handler);
 	return 0;
 }
@@ -467,9 +457,6 @@ static int cpufreq_governor_start(struct
 		io_busy = od_tuners->io_is_busy;
 	}
 
-	shared->policy = policy;
-	shared->time_stamp = ktime_get();
-
 	for_each_cpu(j, policy->cpus) {
 		struct cpu_dbs_info *j_cdbs = cdata->get_cpu_cdbs(j);
 		unsigned int prev_load;
@@ -485,10 +472,9 @@ static int cpufreq_governor_start(struct
 		if (ignore_nice)
 			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 
-		__setup_timer(&j_cdbs->timer, dbs_timer_handler,
-			      (unsigned long)j_cdbs,
-			      TIMER_DEFERRABLE | TIMER_IRQSAFE);
+		j_cdbs->update_util.func = dbs_update_util_handler;
 	}
+	shared->policy = policy;
 
 	if (cdata->governor == GOV_CONSERVATIVE) {
 		struct cs_cpu_dbs_info_s *cs_dbs_info =
@@ -505,7 +491,7 @@ static int cpufreq_governor_start(struct
 		od_ops->powersave_bias_init_cpu(cpu);
 	}
 
-	gov_add_timers(policy, delay_for_sampling_rate(sampling_rate));
+	gov_set_update_util(shared, sampling_rate);
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -191,7 +191,7 @@ static void od_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int od_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int od_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	unsigned int cpu = policy->cpu;
@@ -200,9 +200,6 @@ static unsigned int od_dbs_timer(struct
 	struct od_dbs_tuners *od_tuners = dbs_data->tuners;
 	int delay = 0, sample_type = dbs_info->sample_type;
 
-	if (!modify_all)
-		goto max_delay;
-
 	/* Common NORMAL_SAMPLE setup */
 	dbs_info->sample_type = OD_NORMAL_SAMPLE;
 	if (sample_type == OD_SUB_SAMPLE) {
@@ -218,7 +215,6 @@ static unsigned int od_dbs_timer(struct
 		}
 	}
 
-max_delay:
 	if (!delay)
 		delay = delay_for_sampling_rate(od_tuners->sampling_rate
 				* dbs_info->rate_mult);
@@ -264,7 +260,6 @@ static void update_sampling_rate(struct
 		struct od_cpu_dbs_info_s *dbs_info;
 		struct cpu_dbs_info *cdbs;
 		struct cpu_common_dbs_info *shared;
-		unsigned long next_sampling, appointed_at;
 
 		dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
 		cdbs = &dbs_info->cdbs;
@@ -288,20 +283,28 @@ static void update_sampling_rate(struct
 		 * policy will be governed by dbs_data, otherwise there can be
 		 * multiple policies that are governed by the same dbs_data.
 		 */
-		if (dbs_data != policy->governor_data)
-			continue;
-
-		/*
-		 * Checking this for any CPU should be fine, timers for all of
-		 * them are scheduled together.
-		 */
-		next_sampling = jiffies + usecs_to_jiffies(new_rate);
-		appointed_at = dbs_info->cdbs.timer.expires;
-
-		if (time_before(next_sampling, appointed_at)) {
-			gov_cancel_work(shared);
-			gov_add_timers(policy, usecs_to_jiffies(new_rate));
-
+		if (dbs_data == policy->governor_data) {
+			mutex_lock(&shared->timer_mutex);
+			/*
+			 * On 32-bit architectures this may race with the
+			 * sample_delay_ns read in dbs_update_util_handler(),
+			 * but that really doesn't matter.  If the read returns
+			 * a value that's too big, the sample will be skipped,
+			 * but the next invocation of dbs_update_util_handler()
+			 * (when the update has been completed) will take a
+			 * sample.  If the returned value is too small, the
+			 * sample will be taken immediately, but that isn't a
+			 * problem, as we want the new rate to take effect
+			 * immediately anyway.
+			 *
+			 * If this runs in parallel with dbs_work_handler(), we
+			 * may end up overwriting the sample_delay_ns value that
+			 * it has just written, but the difference should not be
+			 * too big and it will be corrected next time a sample
+			 * is taken, so it shouldn't be significant.
+			 */
+			gov_update_sample_delay(shared, new_rate);
+			mutex_unlock(&shared->timer_mutex);
 		}
 	}
 
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -115,14 +115,12 @@ static void cs_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int cs_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int cs_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
-	if (modify_all)
-		dbs_check_cpu(dbs_data, policy->cpu);
-
+	dbs_check_cpu(dbs_data, policy->cpu);
 	return delay_for_sampling_rate(cs_tuners->sampling_rate);
 }
 

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v5] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-07 14:50         ` [PATCH 3/3 v5] " Rafael J. Wysocki
@ 2016-02-07 15:36           ` Viresh Kumar
  2016-02-09 10:01           ` Gautham R Shenoy
  1 sibling, 0 replies; 134+ messages in thread
From: Viresh Kumar @ 2016-02-07 15:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Linux Kernel Mailing List, Peter Zijlstra,
	Srinivas Pandruvada, Juri Lelli, Steve Muckle, Thomas Gleixner

On 07-02-16, 15:50, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Instead of using a per-CPU deferrable timer for queuing up governor
> work items, register a utilization update callback that will be
> invoked from the scheduler on utilization changes.
> 
> The sampling rate is still the same as what was used for the
> deferrable timers and the added irq_work overhead should be offset by
> the eliminated timers overhead, so in theory the functional impact of
> this patch should not be significant.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
> 
> Changes from v4:
> - Moved init_irq_work() to alloc_common_dbs_info() so it is done along with
>   the INIT_WORK() on the main work structure (which seems more logical to me).
> - Added the ACK from Viresh (in the hope that it still applied).

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-07 14:43                 ` Rafael J. Wysocki
@ 2016-02-08  2:08                   ` Rafael J. Wysocki
  2016-02-08 11:52                     ` Viresh Kumar
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-08  2:08 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On Sunday, February 07, 2016 03:43:20 PM Rafael J. Wysocki wrote:
> On Sunday, February 07, 2016 02:40:40 PM Viresh Kumar wrote:
> > On 06-02-16, 00:10, Rafael J. Wysocki wrote:
> > > On Friday, February 05, 2016 08:17:56 PM Viresh Kumar wrote:
> > > > Okay, how about this then.
> > > > 
> > > > We do some computations here and based on them, conditionally want to
> > > > update sample_delay_ns. Because there is no penalty now, in terms of
> > > > removing/adding timers/wq, etc, why shouldn't we simply update the
> > > > sample_delay_ns everytime without any checks? That would mean that the
> > > > change of sampling rate is effective immediately, what can be better than that?
> > > 
> > > Yes, we can do that.
> > > 
> > > There is a small concern about updating in parallel with dbs_work_handler()
> > > in which case we may overwrite the (hopefully already correct) sample_delay_ns
> > > value that it has just written, but then it will be corrected next time we
> > > take a sample, so it shouldn't be a big deal.
> > > 
> > > OK, I'll update the patch to do that.
> > 
> > Great.
> > 
> > > > Also, we should do the same from update-sampling-rate of conservative
> > > > governor as well.
> > > 
> > > Let's just not change the whole world in one patch, OK?
> > 
> > Yeah, I wasn't asking to update in the same patch, but just that we
> > should do that as well.
> > 
> > > > I did bit of that this morning, and there weren't any serious issues as
> > > > as far as I could see :)
> > > 
> > > The case I'm mostly concerned about is when update_sampling_rate() looks
> > > at a CPU with a policy completely unrelated to the dbs_data it was called
> > > for.  In that case the "shared" object may just go away from under it at
> > > any time while it is looking at that object in theory.
> > 
> > Right, a way (ofcourse we should try find something better) is to move
> > that update to a separate work item, just as I did it in my patch..
> 
> No, it isn't.  Trying to do it asynchronously will only lead to more
> concurrency-related issues.
> 
> > But, I am quite sure we can get that fixed.
> 
> What we need to do, is to make it possible for update_sampling_rate()
> to walk all of the cpu_dbs_infos and look at what their policy_dbs
> fields point to safely.
> 
> After my cleanup patches it does that under dbs_data_mutex and that works,
> because this mutex is also held around *any* updates of struct cpu_dbs_info
> anywhere.
> 
> However, the cpu_dbs_infos themselves are actually static, so they can be
> accessed at any time.  It looks like, then, we may just need to add a lock to
> each of them to ensure that the policy_dbs thing won't go away suddenly and
> we may not need dbs_data_mutex in there any more.

Moreover, update_sampling_rate() doesn't need to walk the cpu_dbs_infos,
it may walk policies instead.  Like after the (untested) appended patch.

Then, if we have a governor_data_lock in struct policy, we can use that
to protect policy_dbs while it is being access there and we're done.

I'll try to prototype something along these lines tomorrow.

Thanks,
Rafael


---
 drivers/cpufreq/cpufreq_ondemand.c |   21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -254,34 +254,23 @@ static void update_sampling_rate(struct
 	cpumask_copy(&cpumask, cpu_online_mask);
 
 	for_each_cpu(cpu, &cpumask) {
-		struct cpufreq_policy *policy;
-		struct od_cpu_dbs_info_s *dbs_info;
-		struct cpu_dbs_info *cdbs;
+		struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
 		struct policy_dbs_info *policy_dbs;
 
-		dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
-		cdbs = &dbs_info->cdbs;
-		policy_dbs = cdbs->policy_dbs;
-
-		/*
-		 * A valid policy_dbs and policy_dbs->policy means governor
-		 * hasn't stopped or exited yet.
-		 */
-		if (!policy_dbs || !policy_dbs->policy)
+		if (!policy)
 			continue;
 
-		policy = policy_dbs->policy;
-
 		/* clear all CPUs of this policy */
 		cpumask_andnot(&cpumask, &cpumask, policy->cpus);
 
+		policy_dbs = policy->governor_data;
 		/*
 		 * Update sampling rate for CPUs whose policy is governed by
 		 * dbs_data. In case of governor_per_policy, only a single
 		 * policy will be governed by dbs_data, otherwise there can be
 		 * multiple policies that are governed by the same dbs_data.
 		 */
-		if (dbs_data == policy_dbs->dbs_data) {
+		if (policy_dbs && policy_dbs->dbs_data == dbs_data) {
 			mutex_lock(&policy_dbs->timer_mutex);
 			/*
 			 * On 32-bit architectures this may race with the
@@ -304,6 +293,8 @@ static void update_sampling_rate(struct
 			gov_update_sample_delay(policy_dbs, new_rate);
 			mutex_unlock(&policy_dbs->timer_mutex);
 		}
+
+		cpufreq_cpu_put(policy);
 	}
 
 	mutex_unlock(&dbs_data_mutex);

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-08  2:08                   ` Rafael J. Wysocki
@ 2016-02-08 11:52                     ` Viresh Kumar
  2016-02-08 12:52                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Viresh Kumar @ 2016-02-08 11:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Peter Zijlstra, Srinivas Pandruvada, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On 08-02-16, 03:08, Rafael J. Wysocki wrote:
> Moreover, update_sampling_rate() doesn't need to walk the cpu_dbs_infos,
> it may walk policies instead.  Like after the (untested) appended patch.
> 
> Then, if we have a governor_data_lock in struct policy, we can use that
> to protect policy_dbs while it is being access there and we're done.
> 
> I'll try to prototype something along these lines tomorrow.

I have solved that in a different way, and dropped the lock from
update_sampling_rate(). Please see if that looks good.

-- 
viresh

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-08 11:52                     ` Viresh Kumar
@ 2016-02-08 12:52                       ` Rafael J. Wysocki
  2016-02-08 13:40                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-08 12:52 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Juri Lelli, Steve Muckle, Thomas Gleixner

On Mon, Feb 8, 2016 at 12:52 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 08-02-16, 03:08, Rafael J. Wysocki wrote:
>> Moreover, update_sampling_rate() doesn't need to walk the cpu_dbs_infos,
>> it may walk policies instead.  Like after the (untested) appended patch.
>>
>> Then, if we have a governor_data_lock in struct policy, we can use that
>> to protect policy_dbs while it is being access there and we're done.
>>
>> I'll try to prototype something along these lines tomorrow.
>
> I have solved that in a different way, and dropped the lock from
> update_sampling_rate(). Please see if that looks good.

Well, almost.

I like the list approach, but you need to be careful about it.  Let me
comment more on the patches in the series.

I have a gut feeling that my idea of walking policies will end up
being simpler in the end, but let's see. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-08 12:52                       ` Rafael J. Wysocki
@ 2016-02-08 13:40                         ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-08 13:40 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Juri Lelli, Steve Muckle, Thomas Gleixner

On Mon, Feb 8, 2016 at 1:52 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Mon, Feb 8, 2016 at 12:52 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>> On 08-02-16, 03:08, Rafael J. Wysocki wrote:
>>> Moreover, update_sampling_rate() doesn't need to walk the cpu_dbs_infos,
>>> it may walk policies instead.  Like after the (untested) appended patch.
>>>
>>> Then, if we have a governor_data_lock in struct policy, we can use that
>>> to protect policy_dbs while it is being access there and we're done.
>>>
>>> I'll try to prototype something along these lines tomorrow.
>>
>> I have solved that in a different way, and dropped the lock from
>> update_sampling_rate(). Please see if that looks good.
>
> Well, almost.
>
> I like the list approach, but you need to be careful about it.  Let me
> comment more on the patches in the series.
>
> I have a gut feeling that my idea of walking policies will end up
> being simpler in the end, but let's see. :-)

Well, my gut feeling seems to have been incorrect, as often happens.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-03 22:20 ` [PATCH 0/3] cpufreq: " Rafael J. Wysocki
  2016-02-04  0:08   ` Srinivas Pandruvada
  2016-02-04 10:51   ` Juri Lelli
@ 2016-02-08 23:06   ` Rafael J. Wysocki
  2016-02-09  0:39     ` Steve Muckle
  2 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-08 23:06 UTC (permalink / raw)
  To: Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

On Wednesday, February 03, 2016 11:20:19 PM Rafael J. Wysocki wrote:
> On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
> > Hi,
> > 
> > The following patch series introduces a mechanism allowing the cpufreq core
> > and "setpolicy" drivers to provide utilization update callbacks to be invoked
> > by the scheduler on utilization changes.  Those callbacks can be used to run
> > the sampling and frequency adjustments code (intel_pstate) or to schedule the
> > execution of that code in process context (cpufreq core) instead of per-CPU
> > deferrable timers used in cpufreq today (which Thomas complained about during
> > the last Kernel Summit).
> > 
> > [1/3] Introduce a mechanism for calling into cpufreq from the scheduler and
> >       registering callbacks to be executed from there.
> > 
> > [2/3] Modify intel_pstate to use the mechanism introduced by [1/3] instead
> >       of per-CPU deferrable timers to do its work.
> > 
> > This isn't entirely straightforward as the scheduler context running those
> > callbacks is really special.  Among other things it can only use raw
> > spinlocks and cannot invoke wake_up_process() directly.  Also, calling
> > ktime_get() from there may be too expensive on some systems.  All that has to
> > be taken into account, but even then the change allows some lines of code to be
> > cut from the driver.
> > 
> > Some performance and energy consumption measurements have been carried out with
> > an earlier version of this patch and it looks like the changes lead to a
> > slightly better performing system that consumes slightly less energy at the
> > same time overall.
> > 
> > [3/3] Modify the cpufreq core to use the mechanism introduced by [1/3] instead
> >       of per-CPU deferrable timers to queue up the execution of governor work.
> > 
> > Again, this isn't really straightforward for the above reasons, but still the
> > code size is reduced a bit by the changes.
> > 
> > I'm still unsure about the energy consumption and performance impact of [3/3]
> > as earlier versions of it led to inconsistent results (most likely due to bugs
> > in them that hopefully have been fixed in this version).  In particular, the
> > additional irq_work may turn out to be problematic, but more optimizations are
> > possible on top of this one even if it makes things worse by itself.
> > 
> > For example, it should be possible to move the execution of state selection
> > code into the utilization update callback itself, at least in principle, for
> > all governors.  The P-state/OPP adjustment may need to be run from process
> > context still, but for the drivers that can do it without sleeping it should
> > be possible to move that into the utilization update callback as well.
> > 
> > The patches are on top of 4.5-rc1 and have been tested on a couple of x86
> > machines.
> 
> Well, no responses here, so I'm inclined to believe that this series is fine
> by everybody (at least by everybody in the CC).
> 
> I can wait for a few days more, but new material is starting to pile up on top
> of these patches and I'll simply need to move forward at one point.

Now that all review comments have been addressed in patch [3/3], I'm going to
put this series into linux-next.

There already is 20+ patches on top of it in the queue including fixes for
bugs that have haunted us for quite some time (and that functionally depend on
this set) and I'd really like all that to get enough linux-next coverage, so
there really isn't more time to wait.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-08 23:06   ` Rafael J. Wysocki
@ 2016-02-09  0:39     ` Steve Muckle
  2016-02-09  1:01       ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Steve Muckle @ 2016-02-09  0:39 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linux PM list
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Thomas Gleixner

Hi Rafael,

On 02/08/2016 03:06 PM, Rafael J. Wysocki wrote:
> Now that all review comments have been addressed in patch [3/3], I'm going to
> put this series into linux-next.
> 
> There already is 20+ patches on top of it in the queue including fixes for
> bugs that have haunted us for quite some time (and that functionally depend on
> this set) and I'd really like all that to get enough linux-next coverage, so
> there really isn't more time to wait.

Sorry for the late reply. As Juri mentioned I was OOO last week and
really just got to look at this today.

One concern I had was, given that the lone scheduler update hook is in
CFS, is it possible for governor updates to be stalled due to RT or DL
task activity?

thanks,
Steve

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-09  0:39     ` Steve Muckle
@ 2016-02-09  1:01       ` Rafael J. Wysocki
  2016-02-09 20:05         ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-09  1:01 UTC (permalink / raw)
  To: Steve Muckle, Peter Zijlstra
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Thomas Gleixner

On Tue, Feb 9, 2016 at 1:39 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> Hi Rafael,
>
> On 02/08/2016 03:06 PM, Rafael J. Wysocki wrote:
>> Now that all review comments have been addressed in patch [3/3], I'm going to
>> put this series into linux-next.
>>
>> There already is 20+ patches on top of it in the queue including fixes for
>> bugs that have haunted us for quite some time (and that functionally depend on
>> this set) and I'd really like all that to get enough linux-next coverage, so
>> there really isn't more time to wait.
>
> Sorry for the late reply. As Juri mentioned I was OOO last week and
> really just got to look at this today.
>
> One concern I had was, given that the lone scheduler update hook is in
> CFS, is it possible for governor updates to be stalled due to RT or DL
> task activity?

I don't think they may be completely stalled, but I'd prefer Peter to
answer that as he suggested to do it this way.

Peter?

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v5] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-07 14:50         ` [PATCH 3/3 v5] " Rafael J. Wysocki
  2016-02-07 15:36           ` Viresh Kumar
@ 2016-02-09 10:01           ` Gautham R Shenoy
  2016-02-09 18:49             ` Rafael J. Wysocki
  1 sibling, 1 reply; 134+ messages in thread
From: Gautham R Shenoy @ 2016-02-09 10:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Linux Kernel Mailing List, Peter Zijlstra,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Steve Muckle,
	Thomas Gleixner

Hello Rafael,

On Sun, Feb 07, 2016 at 03:50:31PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Instead of using a per-CPU deferrable timer for queuing up governor
> work items, register a utilization update callback that will be
> invoked from the scheduler on utilization changes.
> 
> The sampling rate is still the same as what was used for the
> deferrable timers and the added irq_work overhead should be offset by
> the eliminated timers overhead, so in theory the functional impact of
> this patch should not be significant.

I tested this patch series (including v5 of PATCH 3) on POWER with
Viresh's CPUFreq test suite. I didn't see any issues with the
patchset except for a lockdep splat involving "s_active" and
"od_dbs_cdata.mutex", which was also observed on 4.5-rc3 and which
was fixed by Viresh's recent patches. 

With a kernbench run, there were no regression when compared to 4.5-rc3.

FWIW, Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

> 
> Thanks,
> Rafael

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 3/3 v5] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-09 10:01           ` Gautham R Shenoy
@ 2016-02-09 18:49             ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-09 18:49 UTC (permalink / raw)
  To: ego
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Peter Zijlstra, Srinivas Pandruvada, Viresh Kumar, Juri Lelli,
	Steve Muckle, Thomas Gleixner

On Tue, Feb 9, 2016 at 11:01 AM, Gautham R Shenoy
<ego@linux.vnet.ibm.com> wrote:
> Hello Rafael,
>
> On Sun, Feb 07, 2016 at 03:50:31PM +0100, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Instead of using a per-CPU deferrable timer for queuing up governor
>> work items, register a utilization update callback that will be
>> invoked from the scheduler on utilization changes.
>>
>> The sampling rate is still the same as what was used for the
>> deferrable timers and the added irq_work overhead should be offset by
>> the eliminated timers overhead, so in theory the functional impact of
>> this patch should not be significant.
>
> I tested this patch series (including v5 of PATCH 3) on POWER with
> Viresh's CPUFreq test suite. I didn't see any issues with the
> patchset except for a lockdep splat involving "s_active" and
> "od_dbs_cdata.mutex", which was also observed on 4.5-rc3 and which
> was fixed by Viresh's recent patches.
>
> With a kernbench run, there were no regression when compared to 4.5-rc3.
>
> FWIW, Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

Thank you!

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-09  1:01       ` Rafael J. Wysocki
@ 2016-02-09 20:05         ` Rafael J. Wysocki
  2016-02-10  1:02           ` Steve Muckle
                             ` (2 more replies)
  0 siblings, 3 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-09 20:05 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Tuesday, February 09, 2016 02:01:39 AM Rafael J. Wysocki wrote:
> On Tue, Feb 9, 2016 at 1:39 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> > Hi Rafael,
> >
> > On 02/08/2016 03:06 PM, Rafael J. Wysocki wrote:
> >> Now that all review comments have been addressed in patch [3/3], I'm going to
> >> put this series into linux-next.
> >>
> >> There already is 20+ patches on top of it in the queue including fixes for
> >> bugs that have haunted us for quite some time (and that functionally depend on
> >> this set) and I'd really like all that to get enough linux-next coverage, so
> >> there really isn't more time to wait.
> >
> > Sorry for the late reply. As Juri mentioned I was OOO last week and
> > really just got to look at this today.
> >
> > One concern I had was, given that the lone scheduler update hook is in
> > CFS, is it possible for governor updates to be stalled due to RT or DL
> > task activity?
> 
> I don't think they may be completely stalled, but I'd prefer Peter to
> answer that as he suggested to do it this way.

In any case, if that concern turns out to be significant in practice, it may
be addressed like in the appended modification of patch [1/3] from the $subject
series.

With that things look like before from the cpufreq side, but the other sched
classes also get a chance to trigger a cpufreq update.  The drawback is the
cpu_clock() call instead of passing the time value from update_load_avg(), but
I guess we can live with that if necessary.

FWIW, this modification doesn't seem to break things on my test machine.

Thanks,
Rafael


Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/cpufreq.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   |    7 +++++++
 include/linux/sched.h     |    7 +++++++
 kernel/sched/deadline.c   |    3 +++
 kernel/sched/fair.c       |   29 ++++++++++++++++++++++++++++-
 kernel/sched/rt.c         |    3 +++
 6 files changed, 92 insertions(+), 1 deletion(-)

Index: linux-pm/include/linux/sched.h
===================================================================
--- linux-pm.orig/include/linux/sched.h
+++ linux-pm/include/linux/sched.h
@@ -3207,4 +3207,11 @@ static inline unsigned long rlimit_max(u
 	return task_rlimit_max(current, limit);
 }
 
+void cpufreq_update_util(unsigned long util, unsigned long max);
+
+static inline void cpufreq_kick(void)
+{
+	cpufreq_update_util(ULONG_MAX, ULONG_MAX);
+}
+
 #endif
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2819,12 +2819,17 @@ static inline int update_cfs_rq_load_avg
 	return decayed || removed;
 }
 
+__weak void cpufreq_update_util(unsigned long util, unsigned long max)
+{
+}
+
 /* Update task and its cfs_rq load average */
 static inline void update_load_avg(struct sched_entity *se, int update_tg)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
-	int cpu = cpu_of(rq_of(cfs_rq));
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -2836,6 +2841,28 @@ static inline void update_load_avg(struc
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
+
+	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+		unsigned long max = rq->cpu_capacity_orig;
+
+		/*
+		 * There are a few boundary cases this might miss but it should
+		 * get called often enough that that should (hopefully) not be
+		 * a real problem -- added to that it only calls on the local
+		 * CPU, so if we enqueue remotely we'll loose an update, but
+		 * the next tick/schedule should update.
+		 *
+		 * It will not get called when we go idle, because the idle
+		 * thread is a different class (!fair), nor will the utilization
+		 * number include things like RT tasks.
+		 *
+		 * As is, the util number is not freq invariant (we'd have to
+		 * implement arch_scale_freq_capacity() for that).
+		 *
+		 * See cpu_util().
+		 */
+		cpufreq_update_util(min(cfs_rq->avg.util_avg, max), max);
+	}
 }
 
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -102,6 +102,50 @@ static LIST_HEAD(cpufreq_governor_list);
 static struct cpufreq_driver *cpufreq_driver;
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
 static DEFINE_RWLOCK(cpufreq_driver_lock);
+
+static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @data: New pointer value.
+ *
+ * Set and publish the update_util_data pointer for the given CPU.  That pointer
+ * points to a struct update_util_data object containing a callback function
+ * to call from cpufreq_update_util().  That function will be called from an RCU
+ * read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU callbacks to free any memory that might be accessed
+ * via the old update_util_data pointer or invoke synchronize_rcu() right after
+ * this function to avoid use-after-free.
+ */
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+{
+	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @util: Current utilization.
+ * @max: Utilization ceiling.
+ *
+ * This function is called by the scheduler on every invocation of
+ * update_load_avg() on the CPU whose utilization is being updated.
+ */
+void cpufreq_update_util(unsigned long util, unsigned long max)
+{
+	struct update_util_data *data;
+
+	rcu_read_lock();
+
+	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
+	if (data && data->func)
+		data->func(data, cpu_clock(smp_processor_id()), util, max);
+
+	rcu_read_unlock();
+}
+
 DEFINE_MUTEX(cpufreq_governor_lock);
 
 /* Flag to suspend/resume CPUFreq governors */
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -322,6 +322,13 @@ int cpufreq_unregister_driver(struct cpu
 const char *cpufreq_get_current_driver(void);
 void *cpufreq_get_driver_data(void);
 
+struct update_util_data {
+	void (*func)(struct update_util_data *data,
+		     u64 time, unsigned long util, unsigned long max);
+};
+
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+
 static inline void cpufreq_verify_within_limits(struct cpufreq_policy *policy,
 		unsigned int min, unsigned int max)
 {
Index: linux-pm/kernel/sched/rt.c
===================================================================
--- linux-pm.orig/kernel/sched/rt.c
+++ linux-pm/kernel/sched/rt.c
@@ -2212,6 +2212,9 @@ static void task_tick_rt(struct rq *rq,
 
 	update_curr_rt(rq);
 
+	/* Kick cpufreq to prevent it from stalling. */
+	cpufreq_kick();
+
 	watchdog(rq, p);
 
 	/*
Index: linux-pm/kernel/sched/deadline.c
===================================================================
--- linux-pm.orig/kernel/sched/deadline.c
+++ linux-pm/kernel/sched/deadline.c
@@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
 {
 	update_curr_dl(rq);
 
+	/* Kick cpufreq to prevent it from stalling. */
+	cpufreq_kick();
+
 	/*
 	 * Even when we have runtime, update_curr_dl() might have resulted in us
 	 * not being the leftmost task anymore. In that case NEED_RESCHED will

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-09 20:05         ` Rafael J. Wysocki
@ 2016-02-10  1:02           ` Steve Muckle
  2016-02-10  1:57             ` Rafael J. Wysocki
  2016-02-11 11:59             ` Peter Zijlstra
  2016-02-10 12:33           ` Juri Lelli
  2016-02-11 11:51           ` Peter Zijlstra
  2 siblings, 2 replies; 134+ messages in thread
From: Steve Muckle @ 2016-02-10  1:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On 02/09/2016 12:05 PM, Rafael J. Wysocki wrote:
>>> One concern I had was, given that the lone scheduler update hook is in
>>> CFS, is it possible for governor updates to be stalled due to RT or DL
>>> task activity?
>>
>> I don't think they may be completely stalled, but I'd prefer Peter to
>> answer that as he suggested to do it this way.
> 
> In any case, if that concern turns out to be significant in practice, it may
> be addressed like in the appended modification of patch [1/3] from the $subject
> series.
> 
> With that things look like before from the cpufreq side, but the other sched
> classes also get a chance to trigger a cpufreq update.  The drawback is the
> cpu_clock() call instead of passing the time value from update_load_avg(), but
> I guess we can live with that if necessary.
> 
> FWIW, this modification doesn't seem to break things on my test machine.
> 
...
> Index: linux-pm/kernel/sched/rt.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/rt.c
> +++ linux-pm/kernel/sched/rt.c
> @@ -2212,6 +2212,9 @@ static void task_tick_rt(struct rq *rq,
>  
>  	update_curr_rt(rq);
>  
> +	/* Kick cpufreq to prevent it from stalling. */
> +	cpufreq_kick();
> +
>  	watchdog(rq, p);
>  
>  	/*
> Index: linux-pm/kernel/sched/deadline.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/deadline.c
> +++ linux-pm/kernel/sched/deadline.c
> @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
>  {
>  	update_curr_dl(rq);
>  
> +	/* Kick cpufreq to prevent it from stalling. */
> +	cpufreq_kick();
> +
>  	/*
>  	 * Even when we have runtime, update_curr_dl() might have resulted in us
>  	 * not being the leftmost task anymore. In that case NEED_RESCHED will

I think additional hooks such as enqueue/dequeue would be needed in
RT/DL. The task tick callbacks will only run if a task in that class is
executing at the time of the tick. There could be intermittent RT/DL
task activity in a frequency domain (the only task activity there, no
CFS tasks) that doesn't happen to overlap the tick. Worst case the task
activity could be periodic in such a way that it never overlaps the tick
and the update is never made.

thanks,
steve

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10  1:02           ` Steve Muckle
@ 2016-02-10  1:57             ` Rafael J. Wysocki
  2016-02-10  3:09               ` Rafael J. Wysocki
  2016-02-11 11:59             ` Peter Zijlstra
  1 sibling, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10  1:57 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Peter Zijlstra,
	Linux PM list, Linux Kernel Mailing List, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Thomas Gleixner

On Wed, Feb 10, 2016 at 2:02 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 02/09/2016 12:05 PM, Rafael J. Wysocki wrote:
>>>> One concern I had was, given that the lone scheduler update hook is in
>>>> CFS, is it possible for governor updates to be stalled due to RT or DL
>>>> task activity?
>>>
>>> I don't think they may be completely stalled, but I'd prefer Peter to
>>> answer that as he suggested to do it this way.
>>
>> In any case, if that concern turns out to be significant in practice, it may
>> be addressed like in the appended modification of patch [1/3] from the $subject
>> series.
>>
>> With that things look like before from the cpufreq side, but the other sched
>> classes also get a chance to trigger a cpufreq update.  The drawback is the
>> cpu_clock() call instead of passing the time value from update_load_avg(), but
>> I guess we can live with that if necessary.
>>
>> FWIW, this modification doesn't seem to break things on my test machine.
>>
> ...
>> Index: linux-pm/kernel/sched/rt.c
>> ===================================================================
>> --- linux-pm.orig/kernel/sched/rt.c
>> +++ linux-pm/kernel/sched/rt.c
>> @@ -2212,6 +2212,9 @@ static void task_tick_rt(struct rq *rq,
>>
>>       update_curr_rt(rq);
>>
>> +     /* Kick cpufreq to prevent it from stalling. */
>> +     cpufreq_kick();
>> +
>>       watchdog(rq, p);
>>
>>       /*
>> Index: linux-pm/kernel/sched/deadline.c
>> ===================================================================
>> --- linux-pm.orig/kernel/sched/deadline.c
>> +++ linux-pm/kernel/sched/deadline.c
>> @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
>>  {
>>       update_curr_dl(rq);
>>
>> +     /* Kick cpufreq to prevent it from stalling. */
>> +     cpufreq_kick();
>> +
>>       /*
>>        * Even when we have runtime, update_curr_dl() might have resulted in us
>>        * not being the leftmost task anymore. In that case NEED_RESCHED will
>
> I think additional hooks such as enqueue/dequeue would be needed in
> RT/DL. The task tick callbacks will only run if a task in that class is
> executing at the time of the tick. There could be intermittent RT/DL
> task activity in a frequency domain (the only task activity there, no
> CFS tasks) that doesn't happen to overlap the tick. Worst case the task
> activity could be periodic in such a way that it never overlaps the tick
> and the update is never made.

So if I'm reading this correctly, it would be better to put the hooks
into update_curr_rt/dl()?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10  1:57             ` Rafael J. Wysocki
@ 2016-02-10  3:09               ` Rafael J. Wysocki
  2016-02-10 19:47                 ` Steve Muckle
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10  3:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Steve Muckle, Rafael J. Wysocki, Peter Zijlstra, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Wed, Feb 10, 2016 at 2:57 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Feb 10, 2016 at 2:02 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
>> On 02/09/2016 12:05 PM, Rafael J. Wysocki wrote:
>>>>> One concern I had was, given that the lone scheduler update hook is in
>>>>> CFS, is it possible for governor updates to be stalled due to RT or DL
>>>>> task activity?
>>>>
>>>> I don't think they may be completely stalled, but I'd prefer Peter to
>>>> answer that as he suggested to do it this way.
>>>
>>> In any case, if that concern turns out to be significant in practice, it may
>>> be addressed like in the appended modification of patch [1/3] from the $subject
>>> series.
>>>
>>> With that things look like before from the cpufreq side, but the other sched
>>> classes also get a chance to trigger a cpufreq update.  The drawback is the
>>> cpu_clock() call instead of passing the time value from update_load_avg(), but
>>> I guess we can live with that if necessary.
>>>
>>> FWIW, this modification doesn't seem to break things on my test machine.
>>>
>> ...
>>> Index: linux-pm/kernel/sched/rt.c
>>> ===================================================================
>>> --- linux-pm.orig/kernel/sched/rt.c
>>> +++ linux-pm/kernel/sched/rt.c
>>> @@ -2212,6 +2212,9 @@ static void task_tick_rt(struct rq *rq,
>>>
>>>       update_curr_rt(rq);
>>>
>>> +     /* Kick cpufreq to prevent it from stalling. */
>>> +     cpufreq_kick();
>>> +
>>>       watchdog(rq, p);
>>>
>>>       /*
>>> Index: linux-pm/kernel/sched/deadline.c
>>> ===================================================================
>>> --- linux-pm.orig/kernel/sched/deadline.c
>>> +++ linux-pm/kernel/sched/deadline.c
>>> @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
>>>  {
>>>       update_curr_dl(rq);
>>>
>>> +     /* Kick cpufreq to prevent it from stalling. */
>>> +     cpufreq_kick();
>>> +
>>>       /*
>>>        * Even when we have runtime, update_curr_dl() might have resulted in us
>>>        * not being the leftmost task anymore. In that case NEED_RESCHED will
>>
>> I think additional hooks such as enqueue/dequeue would be needed in
>> RT/DL. The task tick callbacks will only run if a task in that class is
>> executing at the time of the tick. There could be intermittent RT/DL
>> task activity in a frequency domain (the only task activity there, no
>> CFS tasks) that doesn't happen to overlap the tick. Worst case the task
>> activity could be periodic in such a way that it never overlaps the tick
>> and the update is never made.
>
> So if I'm reading this correctly, it would be better to put the hooks
> into update_curr_rt/dl()?

If done this way, I guess we may pass rq_clock_task(rq) as the time
arg to cpufreq_update_util() from there and then the cpu_lock() call
I've added to this prototype won't be necessary any more.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-09 20:05         ` Rafael J. Wysocki
  2016-02-10  1:02           ` Steve Muckle
@ 2016-02-10 12:33           ` Juri Lelli
  2016-02-10 13:23             ` Rafael J. Wysocki
  2016-02-11 11:51           ` Peter Zijlstra
  2 siblings, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-02-10 12:33 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Steve Muckle, Rafael J. Wysocki, Peter Zijlstra, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Thomas Gleixner

Hi Rafael,

On 09/02/16 21:05, Rafael J. Wysocki wrote:

[...]

> +/**
> + * cpufreq_update_util - Take a note about CPU utilization changes.
> + * @util: Current utilization.
> + * @max: Utilization ceiling.
> + *
> + * This function is called by the scheduler on every invocation of
> + * update_load_avg() on the CPU whose utilization is being updated.
> + */
> +void cpufreq_update_util(unsigned long util, unsigned long max)
> +{
> +	struct update_util_data *data;
> +
> +	rcu_read_lock();
> +
> +	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
> +	if (data && data->func)
> +		data->func(data, cpu_clock(smp_processor_id()), util, max);

Are util and max used anywhere? It seems to me that cpu_clock is used by
the callbacks to check if the sampling period is elapsed, but I couldn't
yet find who is using util and max.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 12:33           ` Juri Lelli
@ 2016-02-10 13:23             ` Rafael J. Wysocki
  2016-02-10 14:03               ` Juri Lelli
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10 13:23 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Steve Muckle, Rafael J. Wysocki,
	Peter Zijlstra, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Thomas Gleixner

On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> Hi Rafael,
>
> On 09/02/16 21:05, Rafael J. Wysocki wrote:
>
> [...]
>
>> +/**
>> + * cpufreq_update_util - Take a note about CPU utilization changes.
>> + * @util: Current utilization.
>> + * @max: Utilization ceiling.
>> + *
>> + * This function is called by the scheduler on every invocation of
>> + * update_load_avg() on the CPU whose utilization is being updated.
>> + */
>> +void cpufreq_update_util(unsigned long util, unsigned long max)
>> +{
>> +     struct update_util_data *data;
>> +
>> +     rcu_read_lock();
>> +
>> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
>> +     if (data && data->func)
>> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
>
> Are util and max used anywhere?

They aren't yet, but they will be.

Maybe not in this cycle (it it takes too much time to integrate the
preliminary changes), but we definitely are going to use those
numbers.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 13:23             ` Rafael J. Wysocki
@ 2016-02-10 14:03               ` Juri Lelli
  2016-02-10 14:26                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-02-10 14:03 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Steve Muckle, Peter Zijlstra, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Thomas Gleixner

On 10/02/16 14:23, Rafael J. Wysocki wrote:
> On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> > Hi Rafael,
> >
> > On 09/02/16 21:05, Rafael J. Wysocki wrote:
> >
> > [...]
> >
> >> +/**
> >> + * cpufreq_update_util - Take a note about CPU utilization changes.
> >> + * @util: Current utilization.
> >> + * @max: Utilization ceiling.
> >> + *
> >> + * This function is called by the scheduler on every invocation of
> >> + * update_load_avg() on the CPU whose utilization is being updated.
> >> + */
> >> +void cpufreq_update_util(unsigned long util, unsigned long max)
> >> +{
> >> +     struct update_util_data *data;
> >> +
> >> +     rcu_read_lock();
> >> +
> >> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
> >> +     if (data && data->func)
> >> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
> >
> > Are util and max used anywhere?
> 
> They aren't yet, but they will be.
> 
> Maybe not in this cycle (it it takes too much time to integrate the
> preliminary changes), but we definitely are going to use those
> numbers.
> 

Oh OK. However, I was under the impression that this set was only
proposing a way to get rid of timers and use the scheduler as heartbeat
for cpufreq governors. The governors' sample based approach wouldn't
change, though. Am I wrong in assuming this?

Also, is linux-pm/bleeding-edge the one I want to fetch to try this set
out?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 14:03               ` Juri Lelli
@ 2016-02-10 14:26                 ` Rafael J. Wysocki
  2016-02-10 14:46                   ` Juri Lelli
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10 14:26 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Steve Muckle,
	Peter Zijlstra, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Thomas Gleixner

On Wed, Feb 10, 2016 at 3:03 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> On 10/02/16 14:23, Rafael J. Wysocki wrote:
>> On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
>> > Hi Rafael,
>> >
>> > On 09/02/16 21:05, Rafael J. Wysocki wrote:
>> >
>> > [...]
>> >
>> >> +/**
>> >> + * cpufreq_update_util - Take a note about CPU utilization changes.
>> >> + * @util: Current utilization.
>> >> + * @max: Utilization ceiling.
>> >> + *
>> >> + * This function is called by the scheduler on every invocation of
>> >> + * update_load_avg() on the CPU whose utilization is being updated.
>> >> + */
>> >> +void cpufreq_update_util(unsigned long util, unsigned long max)
>> >> +{
>> >> +     struct update_util_data *data;
>> >> +
>> >> +     rcu_read_lock();
>> >> +
>> >> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
>> >> +     if (data && data->func)
>> >> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
>> >
>> > Are util and max used anywhere?
>>
>> They aren't yet, but they will be.
>>
>> Maybe not in this cycle (it it takes too much time to integrate the
>> preliminary changes), but we definitely are going to use those
>> numbers.
>>
>
> Oh OK. However, I was under the impression that this set was only
> proposing a way to get rid of timers and use the scheduler as heartbeat
> for cpufreq governors. The governors' sample based approach wouldn't
> change, though. Am I wrong in assuming this?

Your assumption is correct.

The sample-based approach doesn't change at this time, simply to avoid
making too many changes in one go.

The next step, as I'm seeing it, would be to use the
scheduler-provided utilization in the governor computations instead of
the load estimation made by governors themselves.

> Also, is linux-pm/bleeding-edge the one I want to fetch to try this set out?

You can get it from there, but possibly with some changes unrelated to cpufreq.

You can also pull from the pm-cpufreq-test branch to get the cpufreq
changes only.

Apart from that, I'm going resend the $subject set with updated patch
[1/3] for completeness.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 14:26                 ` Rafael J. Wysocki
@ 2016-02-10 14:46                   ` Juri Lelli
  2016-02-10 15:46                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-02-10 14:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Steve Muckle, Peter Zijlstra, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Thomas Gleixner

On 10/02/16 15:26, Rafael J. Wysocki wrote:
> On Wed, Feb 10, 2016 at 3:03 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> > On 10/02/16 14:23, Rafael J. Wysocki wrote:
> >> On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> >> > Hi Rafael,
> >> >
> >> > On 09/02/16 21:05, Rafael J. Wysocki wrote:
> >> >
> >> > [...]
> >> >
> >> >> +/**
> >> >> + * cpufreq_update_util - Take a note about CPU utilization changes.
> >> >> + * @util: Current utilization.
> >> >> + * @max: Utilization ceiling.
> >> >> + *
> >> >> + * This function is called by the scheduler on every invocation of
> >> >> + * update_load_avg() on the CPU whose utilization is being updated.
> >> >> + */
> >> >> +void cpufreq_update_util(unsigned long util, unsigned long max)
> >> >> +{
> >> >> +     struct update_util_data *data;
> >> >> +
> >> >> +     rcu_read_lock();
> >> >> +
> >> >> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
> >> >> +     if (data && data->func)
> >> >> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
> >> >
> >> > Are util and max used anywhere?
> >>
> >> They aren't yet, but they will be.
> >>
> >> Maybe not in this cycle (it it takes too much time to integrate the
> >> preliminary changes), but we definitely are going to use those
> >> numbers.
> >>
> >
> > Oh OK. However, I was under the impression that this set was only
> > proposing a way to get rid of timers and use the scheduler as heartbeat
> > for cpufreq governors. The governors' sample based approach wouldn't
> > change, though. Am I wrong in assuming this?
> 
> Your assumption is correct.
> 

In this case. Wouldn't be possible to simply put the kicks in
sched/core.c? scheduler_tick() seems a good candidate for that, and you
could complement that with enqueue/dequeue/etc., if needed.

I'm actually wondering if a slow CONFIG_HZ might affect governors'
sampling rate. We might have scheduler tick firing every 40ms and
sampling rate set to 10 or 20ms, don't we?

> The sample-based approach doesn't change at this time, simply to avoid
> making too many changes in one go.
> 
> The next step, as I'm seeing it, would be to use the
> scheduler-provided utilization in the governor computations instead of
> the load estimation made by governors themselves.
> 

OK. But, I'm not sure what does this buy us. If the end goal is still to
do sampling, aren't we better off using the (1 - idle) estimation as
today?

> > Also, is linux-pm/bleeding-edge the one I want to fetch to try this set out?
> 
> You can get it from there, but possibly with some changes unrelated to cpufreq.
> 
> You can also pull from the pm-cpufreq-test branch to get the cpufreq
> changes only.
> 
> Apart from that, I'm going resend the $subject set with updated patch
> [1/3] for completeness.
> 

Great, thanks! Let's see if I can finally find time to run some tests
this time :).

Best,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-01-29 22:52 [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks Rafael J. Wysocki
                   ` (3 preceding siblings ...)
  2016-02-03 22:20 ` [PATCH 0/3] cpufreq: " Rafael J. Wysocki
@ 2016-02-10 15:17 ` Rafael J. Wysocki
  2016-02-10 15:21   ` [PATCH v6 1/3] cpufreq: Add mechanism for registering " Rafael J. Wysocki
                     ` (3 more replies)
  4 siblings, 4 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10 15:17 UTC (permalink / raw)
  To: Linux PM list, Ingo Molnar
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

Hi,

I thought it would be useful to send an update of this (adding Ingo, as Peter
has not been responsive lately).  The version goes straight to 6 as patch [3/3]
has already gone through 5 revisions.

The intro below still applies, so let me quote it.

On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
> Hi,
> 
> The following patch series introduces a mechanism allowing the cpufreq core
> and "setpolicy" drivers to provide utilization update callbacks to be invoked
> by the scheduler on utilization changes.  Those callbacks can be used to run
> the sampling and frequency adjustments code (intel_pstate) or to schedule the
> execution of that code in process context (cpufreq core) instead of per-CPU
> deferrable timers used in cpufreq today (which Thomas complained about during
> the last Kernel Summit).
> 
> [1/3] Introduce a mechanism for calling into cpufreq from the scheduler and
>       registering callbacks to be executed from there.
> 
> [2/3] Modify intel_pstate to use the mechanism introduced by [1/3] instead
>       of per-CPU deferrable timers to do its work.
> 
> This isn't entirely straightforward as the scheduler context running those
> callbacks is really special.  Among other things it can only use raw
> spinlocks and cannot invoke wake_up_process() directly.  Also, calling
> ktime_get() from there may be too expensive on some systems.  All that has to
> be taken into account, but even then the change allows some lines of code to be
> cut from the driver.
> 
> Some performance and energy consumption measurements have been carried out with
> an earlier version of this patch and it looks like the changes lead to a
> slightly better performing system that consumes slightly less energy at the
> same time overall.
> 
> [3/3] Modify the cpufreq core to use the mechanism introduced by [1/3] instead
>       of per-CPU deferrable timers to queue up the execution of governor work.
> 
> Again, this isn't really straightforward for the above reasons, but still the
> code size is reduced a bit by the changes.
> 

As it turns out, patch [3/3] appears to lead to improvements in both overall
system performance and energy consumption at the same time (the are small, but
measurable).  It also unlocks further simplifications and fixes in the cpufreq
core code, so we want it badly. :-)

The most significant change from the previous version of the set is that [1/3]
now also triggers cpufreq updates from the RT and DL sched classes to avoid
stalling it in situations when no CFS activity is taking place on the CPU due
to RT/DL tasks activity (as pointed out by Steve).

As stated in a reply to Juri, the scheduler-provided utilization numbers are
not used by cpufreq at this time, but we will be using them going forward.

The patches are on top of 4.5-rc3 and have been tested on x86 machines.

There aleady is a metric ton of stuff to go on top of them, so I'd like to
make progress here if at all possible,

I'll put this set (along with all the stuff depending on it) into the
pm-cpufreq-test branch of the linux-pm tree.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH v6 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-10 15:17 ` [PATCH v6 " Rafael J. Wysocki
@ 2016-02-10 15:21   ` Rafael J. Wysocki
  2016-02-10 23:01     ` [PATCH v7 " Rafael J. Wysocki
  2016-02-10 15:25   ` [PATCH v6 2/3] cpufreq: intel_pstate: Replace timers with " Rafael J. Wysocki
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10 15:21 UTC (permalink / raw)
  To: Linux PM list, Ingo Molnar
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.

This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.

The update_load_avg() changes were suggested by Peter Zijlstra.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
---

Hi Ingo,

This has been based on Peter's advice, but he's not been well for the last
several days, so can you plase have a look at this and let me know whether
or not it is acceptable and how it can be improved possibly?

The ACK from Viresh applies to the cpufreq core changes that are the same
as in the previous version(s) of this patch.

Thanks,
Rafael

---
 drivers/cpufreq/cpufreq.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   |   17 +++++++++++++++++
 kernel/sched/deadline.c   |    3 +++
 kernel/sched/fair.c       |   26 +++++++++++++++++++++++++-
 kernel/sched/rt.c         |    3 +++
 kernel/sched/sched.h      |    1 +
 6 files changed, 94 insertions(+), 1 deletion(-)

Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -151,6 +151,19 @@ static inline bool policy_is_shared(stru
 extern struct kobject *cpufreq_global_kobject;
 
 #ifdef CONFIG_CPU_FREQ
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+static inline void cpufreq_trigger_update(u64 time)
+{
+	cpufreq_update_util(time, ULONG_MAX, 0);
+}
+
+struct update_util_data {
+	void (*func)(struct update_util_data *data,
+		     u64 time, unsigned long util, unsigned long max);
+};
+
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+
 unsigned int cpufreq_get(unsigned int cpu);
 unsigned int cpufreq_quick_get(unsigned int cpu);
 unsigned int cpufreq_quick_get_max(unsigned int cpu);
@@ -162,6 +175,10 @@ int cpufreq_update_policy(unsigned int c
 bool have_governor_per_policy(void);
 struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
 #else
+static inline void cpufreq_update_util(u64 time, unsigned long util,
+				       unsigned long max) {}
+static inline void cpufreq_trigger_update(u64 time) {}
+
 static inline unsigned int cpufreq_get(unsigned int cpu)
 {
 	return 0;
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -9,6 +9,7 @@
 #include <linux/irq_work.h>
 #include <linux/tick.h>
 #include <linux/slab.h>
+#include <linux/cpufreq.h>
 
 #include "cpupri.h"
 #include "cpudeadline.h"
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2824,7 +2824,8 @@ static inline void update_load_avg(struc
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
-	int cpu = cpu_of(rq_of(cfs_rq));
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -2836,6 +2837,29 @@ static inline void update_load_avg(struc
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
+
+	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+		unsigned long max = rq->cpu_capacity_orig;
+
+		/*
+		 * There are a few boundary cases this might miss but it should
+		 * get called often enough that that should (hopefully) not be
+		 * a real problem -- added to that it only calls on the local
+		 * CPU, so if we enqueue remotely we'll miss an update, but
+		 * the next tick/schedule should update.
+		 *
+		 * It will not get called when we go idle, because the idle
+		 * thread is a different class (!fair), nor will the utilization
+		 * number include things like RT tasks.
+		 *
+		 * As is, the util number is not freq-invariant (we'd have to
+		 * implement arch_scale_freq_capacity() for that).
+		 *
+		 * See cpu_util().
+		 */
+		cpufreq_update_util(rq_clock_task(rq),
+				    min(cfs_rq->avg.util_avg, max), max);
+	}
 }
 
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
Index: linux-pm/kernel/sched/deadline.c
===================================================================
--- linux-pm.orig/kernel/sched/deadline.c
+++ linux-pm/kernel/sched/deadline.c
@@ -726,6 +726,9 @@ static void update_curr_dl(struct rq *rq
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
 
+	/* Kick a cpufreq update to prevent it from stalling. */
+	cpufreq_trigger_update(rq_clock_task(rq));
+
 	/*
 	 * Consumed budget is computed considering the time as
 	 * observed by schedulable tasks (excluding time spent
Index: linux-pm/kernel/sched/rt.c
===================================================================
--- linux-pm.orig/kernel/sched/rt.c
+++ linux-pm/kernel/sched/rt.c
@@ -949,6 +949,9 @@ static void update_curr_rt(struct rq *rq
 	if (unlikely((s64)delta_exec <= 0))
 		return;
 
+	/* Kick a cpufreq update to prevent it from stalling. */
+	cpufreq_trigger_update(rq_clock_task(rq));
+
 	schedstat_set(curr->se.statistics.exec_max,
 		      max(curr->se.statistics.exec_max, delta_exec));
 
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -102,6 +102,51 @@ static LIST_HEAD(cpufreq_governor_list);
 static struct cpufreq_driver *cpufreq_driver;
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
 static DEFINE_RWLOCK(cpufreq_driver_lock);
+
+static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @data: New pointer value.
+ *
+ * Set and publish the update_util_data pointer for the given CPU.  That pointer
+ * points to a struct update_util_data object containing a callback function
+ * to call from cpufreq_update_util().  That function will be called from an RCU
+ * read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU callbacks to free any memory that might be accessed
+ * via the old update_util_data pointer or invoke synchronize_rcu() right after
+ * this function to avoid use-after-free.
+ */
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+{
+	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @time: Current time.
+ * @util: Current utilization.
+ * @max: Utilization ceiling.
+ *
+ * This function is called by the scheduler on every invocation of
+ * update_load_avg() on the CPU whose utilization is being updated.
+ */
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+	struct update_util_data *data;
+
+	rcu_read_lock();
+
+	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
+	if (data && data->func)
+		data->func(data, time, util, max);
+
+	rcu_read_unlock();
+}
+
 DEFINE_MUTEX(cpufreq_governor_lock);
 
 /* Flag to suspend/resume CPUFreq governors */

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH v6 2/3] cpufreq: intel_pstate: Replace timers with utilization update callbacks
  2016-02-10 15:17 ` [PATCH v6 " Rafael J. Wysocki
  2016-02-10 15:21   ` [PATCH v6 1/3] cpufreq: Add mechanism for registering " Rafael J. Wysocki
@ 2016-02-10 15:25   ` Rafael J. Wysocki
  2016-02-10 15:36   ` [PATCH v6 3/3] cpufreq: governor: " Rafael J. Wysocki
  2016-02-10 23:11   ` [PATCH v6 0/3] cpufreq: " Doug Smythies
  3 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10 15:25 UTC (permalink / raw)
  To: Linux PM list
  Cc: Ingo Molnar, Linux Kernel Mailing List, Peter Zijlstra,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Steve Muckle,
	Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Instead of using a per-CPU deferrable timer for utilization sampling
and P-states adjustments, register a utilization update callback that
will be invoked from the scheduler on utilization changes.

The sampling rate is still the same as what was used for the deferrable
timers, so the functional impact of this patch should not be significant.

Based on an earlier patch from Srinivas Pandruvada.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---

No changes from the previous version.

---
 drivers/cpufreq/intel_pstate.c |  103 +++++++++++++++--------------------------
 1 file changed, 39 insertions(+), 64 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -71,7 +71,7 @@ struct sample {
 	u64 mperf;
 	u64 tsc;
 	int freq;
-	ktime_t time;
+	u64 time;
 };
 
 struct pstate_data {
@@ -103,13 +103,13 @@ struct _pid {
 struct cpudata {
 	int cpu;
 
-	struct timer_list timer;
+	struct update_util_data update_util;
 
 	struct pstate_data pstate;
 	struct vid_data vid;
 	struct _pid pid;
 
-	ktime_t last_sample_time;
+	u64	last_sample_time;
 	u64	prev_aperf;
 	u64	prev_mperf;
 	u64	prev_tsc;
@@ -120,6 +120,7 @@ struct cpudata {
 static struct cpudata **all_cpu_data;
 struct pstate_adjust_policy {
 	int sample_rate_ms;
+	s64 sample_rate_ns;
 	int deadband;
 	int setpoint;
 	int p_gain_pct;
@@ -712,7 +713,7 @@ static void core_set_pstate(struct cpuda
 	if (limits->no_turbo && !limits->turbo_disabled)
 		val |= (u64)1 << 32;
 
-	wrmsrl_on_cpu(cpudata->cpu, MSR_IA32_PERF_CTL, val);
+	wrmsrl(MSR_IA32_PERF_CTL, val);
 }
 
 static int knl_get_turbo_pstate(void)
@@ -883,7 +884,7 @@ static inline void intel_pstate_calc_bus
 	sample->core_pct_busy = (int32_t)core_pct;
 }
 
-static inline void intel_pstate_sample(struct cpudata *cpu)
+static inline void intel_pstate_sample(struct cpudata *cpu, u64 time)
 {
 	u64 aperf, mperf;
 	unsigned long flags;
@@ -900,7 +901,7 @@ static inline void intel_pstate_sample(s
 	local_irq_restore(flags);
 
 	cpu->last_sample_time = cpu->sample.time;
-	cpu->sample.time = ktime_get();
+	cpu->sample.time = time;
 	cpu->sample.aperf = aperf;
 	cpu->sample.mperf = mperf;
 	cpu->sample.tsc =  tsc;
@@ -915,22 +916,6 @@ static inline void intel_pstate_sample(s
 	cpu->prev_tsc = tsc;
 }
 
-static inline void intel_hwp_set_sample_time(struct cpudata *cpu)
-{
-	int delay;
-
-	delay = msecs_to_jiffies(50);
-	mod_timer_pinned(&cpu->timer, jiffies + delay);
-}
-
-static inline void intel_pstate_set_sample_time(struct cpudata *cpu)
-{
-	int delay;
-
-	delay = msecs_to_jiffies(pid_params.sample_rate_ms);
-	mod_timer_pinned(&cpu->timer, jiffies + delay);
-}
-
 static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
 {
 	struct sample *sample = &cpu->sample;
@@ -970,8 +955,7 @@ static inline int32_t get_target_pstate_
 static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu)
 {
 	int32_t core_busy, max_pstate, current_pstate, sample_ratio;
-	s64 duration_us;
-	u32 sample_time;
+	u64 duration_ns;
 
 	/*
 	 * core_busy is the ratio of actual performance to max
@@ -990,18 +974,16 @@ static inline int32_t get_target_pstate_
 	core_busy = mul_fp(core_busy, div_fp(max_pstate, current_pstate));
 
 	/*
-	 * Since we have a deferred timer, it will not fire unless
-	 * we are in C0.  So, determine if the actual elapsed time
-	 * is significantly greater (3x) than our sample interval.  If it
-	 * is, then we were idle for a long enough period of time
-	 * to adjust our busyness.
+	 * Since our utilization update callback will not run unless we are
+	 * in C0, check if the actual elapsed time is significantly greater (3x)
+	 * than our sample interval.  If it is, then we were idle for a long
+	 * enough period of time to adjust our busyness.
 	 */
-	sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
-	duration_us = ktime_us_delta(cpu->sample.time,
-				     cpu->last_sample_time);
-	if (duration_us > sample_time * 3) {
-		sample_ratio = div_fp(int_tofp(sample_time),
-				      int_tofp(duration_us));
+	duration_ns = cpu->sample.time - cpu->last_sample_time;
+	if ((s64)duration_ns > pid_params.sample_rate_ns * 3
+	    && cpu->last_sample_time > 0) {
+		sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns),
+				      int_tofp(duration_ns));
 		core_busy = mul_fp(core_busy, sample_ratio);
 	}
 
@@ -1031,23 +1013,17 @@ static inline void intel_pstate_adjust_b
 		sample->freq);
 }
 
-static void intel_hwp_timer_func(unsigned long __data)
-{
-	struct cpudata *cpu = (struct cpudata *) __data;
-
-	intel_pstate_sample(cpu);
-	intel_hwp_set_sample_time(cpu);
-}
-
-static void intel_pstate_timer_func(unsigned long __data)
+static void intel_pstate_update_util(struct update_util_data *data, u64 time,
+				     unsigned long util, unsigned long max)
 {
-	struct cpudata *cpu = (struct cpudata *) __data;
-
-	intel_pstate_sample(cpu);
+	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+	u64 delta_ns = time - cpu->sample.time;
 
-	intel_pstate_adjust_busy_pstate(cpu);
-
-	intel_pstate_set_sample_time(cpu);
+	if ((s64)delta_ns >= pid_params.sample_rate_ns) {
+		intel_pstate_sample(cpu, time);
+		if (!hwp_active)
+			intel_pstate_adjust_busy_pstate(cpu);
+	}
 }
 
 #define ICPU(model, policy) \
@@ -1095,24 +1071,19 @@ static int intel_pstate_init_cpu(unsigne
 
 	cpu->cpu = cpunum;
 
-	if (hwp_active)
+	if (hwp_active) {
 		intel_pstate_hwp_enable(cpu);
+		pid_params.sample_rate_ms = 50;
+		pid_params.sample_rate_ns = 50 * NSEC_PER_MSEC;
+	}
 
 	intel_pstate_get_cpu_pstates(cpu);
 
-	init_timer_deferrable(&cpu->timer);
-	cpu->timer.data = (unsigned long)cpu;
-	cpu->timer.expires = jiffies + HZ/100;
-
-	if (!hwp_active)
-		cpu->timer.function = intel_pstate_timer_func;
-	else
-		cpu->timer.function = intel_hwp_timer_func;
-
 	intel_pstate_busy_pid_reset(cpu);
-	intel_pstate_sample(cpu);
+	intel_pstate_sample(cpu, 0);
 
-	add_timer_on(&cpu->timer, cpunum);
+	cpu->update_util.func = intel_pstate_update_util;
+	cpufreq_set_update_util_data(cpunum, &cpu->update_util);
 
 	pr_debug("intel_pstate: controlling: cpu %d\n", cpunum);
 
@@ -1196,7 +1167,9 @@ static void intel_pstate_stop_cpu(struct
 
 	pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
 
-	del_timer_sync(&all_cpu_data[cpu_num]->timer);
+	cpufreq_set_update_util_data(cpu_num, NULL);
+	synchronize_rcu();
+
 	if (hwp_active)
 		return;
 
@@ -1260,6 +1233,7 @@ static int intel_pstate_msrs_not_valid(v
 static void copy_pid_params(struct pstate_adjust_policy *policy)
 {
 	pid_params.sample_rate_ms = policy->sample_rate_ms;
+	pid_params.sample_rate_ns = pid_params.sample_rate_ms * NSEC_PER_MSEC;
 	pid_params.p_gain_pct = policy->p_gain_pct;
 	pid_params.i_gain_pct = policy->i_gain_pct;
 	pid_params.d_gain_pct = policy->d_gain_pct;
@@ -1451,7 +1425,8 @@ out:
 	get_online_cpus();
 	for_each_online_cpu(cpu) {
 		if (all_cpu_data[cpu]) {
-			del_timer_sync(&all_cpu_data[cpu]->timer);
+			cpufreq_set_update_util_data(cpu, NULL);
+			synchronize_rcu();
 			kfree(all_cpu_data[cpu]);
 		}
 	}

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH v6 3/3] cpufreq: governor: Replace timers with utilization update callbacks
  2016-02-10 15:17 ` [PATCH v6 " Rafael J. Wysocki
  2016-02-10 15:21   ` [PATCH v6 1/3] cpufreq: Add mechanism for registering " Rafael J. Wysocki
  2016-02-10 15:25   ` [PATCH v6 2/3] cpufreq: intel_pstate: Replace timers with " Rafael J. Wysocki
@ 2016-02-10 15:36   ` Rafael J. Wysocki
  2016-02-10 23:11   ` [PATCH v6 0/3] cpufreq: " Doug Smythies
  3 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10 15:36 UTC (permalink / raw)
  To: Linux PM list
  Cc: Ingo Molnar, Linux Kernel Mailing List, Peter Zijlstra,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Steve Muckle,
	Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Instead of using a per-CPU deferrable timer for queuing up governor
work items, register a utilization update callback that will be
invoked from the scheduler on utilization changes.

The sampling rate is still the same as what was used for the
deferrable timers and the added irq_work overhead should be offset by
the eliminated timers overhead, so in theory the functional impact of
this patch should not be significant.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---

Changes from the v5:
- Use #ifdef/#else in gov_queue_irq_work() to avoid build failures.
- Select IRQ_WORK in cpufreq Kconfig to avoid build failures.

No functional changes.

---
 drivers/cpufreq/Kconfig                |    1 
 drivers/cpufreq/cpufreq_conservative.c |    6 -
 drivers/cpufreq/cpufreq_governor.c     |  165 +++++++++++++++------------------
 drivers/cpufreq/cpufreq_governor.h     |   19 ++-
 drivers/cpufreq/cpufreq_ondemand.c     |   43 ++++----
 5 files changed, 114 insertions(+), 120 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -18,6 +18,7 @@
 #define _CPUFREQ_GOVERNOR_H
 
 #include <linux/atomic.h>
+#include <linux/irq_work.h>
 #include <linux/cpufreq.h>
 #include <linux/kernel_stat.h>
 #include <linux/module.h>
@@ -138,11 +139,19 @@ struct cpu_common_dbs_info {
 	 */
 	struct mutex timer_mutex;
 
-	ktime_t time_stamp;
+	u64 last_sample_time;
+	s64 sample_delay_ns;
 	atomic_t skip_work;
+	struct irq_work irq_work;
 	struct work_struct work;
 };
 
+static inline void gov_update_sample_delay(struct cpu_common_dbs_info *shared,
+					   unsigned int delay_us)
+{
+	shared->sample_delay_ns = delay_us * NSEC_PER_USEC;
+}
+
 /* Per cpu structures */
 struct cpu_dbs_info {
 	u64 prev_cpu_idle;
@@ -155,7 +164,7 @@ struct cpu_dbs_info {
 	 * wake-up from idle.
 	 */
 	unsigned int prev_load;
-	struct timer_list timer;
+	struct update_util_data update_util;
 	struct cpu_common_dbs_info *shared;
 };
 
@@ -212,8 +221,7 @@ struct common_dbs_data {
 
 	struct cpu_dbs_info *(*get_cpu_cdbs)(int cpu);
 	void *(*get_cpu_dbs_info_s)(int cpu);
-	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy,
-				      bool modify_all);
+	unsigned int (*gov_dbs_timer)(struct cpufreq_policy *policy);
 	void (*gov_check_cpu)(int cpu, unsigned int load);
 	int (*init)(struct dbs_data *dbs_data, bool notify);
 	void (*exit)(struct dbs_data *dbs_data, bool notify);
@@ -270,9 +278,6 @@ static ssize_t show_sampling_rate_min_go
 }
 
 extern struct mutex cpufreq_governor_lock;
-
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay);
-void gov_cancel_work(struct cpu_common_dbs_info *shared);
 void dbs_check_cpu(struct dbs_data *dbs_data, int cpu);
 int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 		struct common_dbs_data *cdata, unsigned int event);
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -128,10 +128,10 @@ void dbs_check_cpu(struct dbs_data *dbs_
 		 * dropped down. So we perform the copy only once, upon the
 		 * first wake-up from idle.)
 		 *
-		 * Detecting this situation is easy: the governor's deferrable
-		 * timer would not have fired during CPU-idle periods. Hence
-		 * an unusually large 'wall_time' (as compared to the sampling
-		 * rate) indicates this scenario.
+		 * Detecting this situation is easy: the governor's utilization
+		 * update handler would not have run during CPU-idle periods.
+		 * Hence, an unusually large 'wall_time' (as compared to the
+		 * sampling rate) indicates this scenario.
 		 *
 		 * prev_load can be zero in two cases and we must recalculate it
 		 * for both cases:
@@ -161,72 +161,48 @@ void dbs_check_cpu(struct dbs_data *dbs_
 }
 EXPORT_SYMBOL_GPL(dbs_check_cpu);
 
-void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay)
+void gov_set_update_util(struct cpu_common_dbs_info *shared,
+			 unsigned int delay_us)
 {
+	struct cpufreq_policy *policy = shared->policy;
 	struct dbs_data *dbs_data = policy->governor_data;
-	struct cpu_dbs_info *cdbs;
 	int cpu;
 
+	gov_update_sample_delay(shared, delay_us);
+	shared->last_sample_time = 0;
+
 	for_each_cpu(cpu, policy->cpus) {
-		cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
-		cdbs->timer.expires = jiffies + delay;
-		add_timer_on(&cdbs->timer, cpu);
+		struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);
+
+		cpufreq_set_update_util_data(cpu, &cdbs->update_util);
 	}
 }
-EXPORT_SYMBOL_GPL(gov_add_timers);
+EXPORT_SYMBOL_GPL(gov_set_update_util);
 
-static inline void gov_cancel_timers(struct cpufreq_policy *policy)
+static inline void gov_clear_update_util(struct cpufreq_policy *policy)
 {
-	struct dbs_data *dbs_data = policy->governor_data;
-	struct cpu_dbs_info *cdbs;
 	int i;
 
-	for_each_cpu(i, policy->cpus) {
-		cdbs = dbs_data->cdata->get_cpu_cdbs(i);
-		del_timer_sync(&cdbs->timer);
-	}
+	for_each_cpu(i, policy->cpus)
+		cpufreq_set_update_util_data(i, NULL);
+
+	synchronize_rcu();
 }
 
-void gov_cancel_work(struct cpu_common_dbs_info *shared)
+static void gov_cancel_work(struct cpu_common_dbs_info *shared)
 {
-	/* Tell dbs_timer_handler() to skip queuing up work items. */
+	/* Tell dbs_update_util_handler() to skip queuing up work items. */
 	atomic_inc(&shared->skip_work);
 	/*
-	 * If dbs_timer_handler() is already running, it may not notice the
-	 * incremented skip_work, so wait for it to complete to prevent its work
-	 * item from being queued up after the cancel_work_sync() below.
-	 */
-	gov_cancel_timers(shared->policy);
-	/*
-	 * In case dbs_timer_handler() managed to run and spawn a work item
-	 * before the timers have been canceled, wait for that work item to
-	 * complete and then cancel all of the timers set up by it.  If
-	 * dbs_timer_handler() runs again at that point, it will see the
-	 * positive value of skip_work and won't spawn any more work items.
+	 * If dbs_update_util_handler() is already running, it may not notice
+	 * the incremented skip_work, so wait for it to complete to prevent its
+	 * work item from being queued up after the cancel_work_sync() below.
 	 */
+	gov_clear_update_util(shared->policy);
+	irq_work_sync(&shared->irq_work);
 	cancel_work_sync(&shared->work);
-	gov_cancel_timers(shared->policy);
 	atomic_set(&shared->skip_work, 0);
 }
-EXPORT_SYMBOL_GPL(gov_cancel_work);
-
-/* Will return if we need to evaluate cpu load again or not */
-static bool need_load_eval(struct cpu_common_dbs_info *shared,
-			   unsigned int sampling_rate)
-{
-	if (policy_is_shared(shared->policy)) {
-		ktime_t time_now = ktime_get();
-		s64 delta_us = ktime_us_delta(time_now, shared->time_stamp);
-
-		/* Do nothing if we recently have sampled */
-		if (delta_us < (s64)(sampling_rate / 2))
-			return false;
-		else
-			shared->time_stamp = time_now;
-	}
-
-	return true;
-}
 
 static void dbs_work_handler(struct work_struct *work)
 {
@@ -234,56 +210,70 @@ static void dbs_work_handler(struct work
 					cpu_common_dbs_info, work);
 	struct cpufreq_policy *policy;
 	struct dbs_data *dbs_data;
-	unsigned int sampling_rate, delay;
-	bool eval_load;
+	unsigned int delay;
 
 	policy = shared->policy;
 	dbs_data = policy->governor_data;
 
-	/* Kill all timers */
-	gov_cancel_timers(policy);
-
-	if (dbs_data->cdata->governor == GOV_CONSERVATIVE) {
-		struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
-
-		sampling_rate = cs_tuners->sampling_rate;
-	} else {
-		struct od_dbs_tuners *od_tuners = dbs_data->tuners;
-
-		sampling_rate = od_tuners->sampling_rate;
-	}
-
-	eval_load = need_load_eval(shared, sampling_rate);
-
 	/*
-	 * Make sure cpufreq_governor_limits() isn't evaluating load in
-	 * parallel.
+	 * Make sure cpufreq_governor_limits() isn't evaluating load or the
+	 * ondemand governor isn't updating the sampling rate in parallel.
 	 */
 	mutex_lock(&shared->timer_mutex);
-	delay = dbs_data->cdata->gov_dbs_timer(policy, eval_load);
+	delay = dbs_data->cdata->gov_dbs_timer(policy);
+	shared->sample_delay_ns = jiffies_to_nsecs(delay);
 	mutex_unlock(&shared->timer_mutex);
 
+	/*
+	 * If the atomic operation below is reordered with respect to the
+	 * sample delay modification, the utilization update handler may end
+	 * up using a stale sample delay value.
+	 */
+	smp_mb__before_atomic();
 	atomic_dec(&shared->skip_work);
+}
 
-	gov_add_timers(policy, delay);
+static void dbs_irq_work(struct irq_work *irq_work)
+{
+	struct cpu_common_dbs_info *shared;
+
+	shared = container_of(irq_work, struct cpu_common_dbs_info, irq_work);
+	schedule_work(&shared->work);
 }
 
-static void dbs_timer_handler(unsigned long data)
+static inline void gov_queue_irq_work(struct cpu_common_dbs_info *shared)
 {
-	struct cpu_dbs_info *cdbs = (struct cpu_dbs_info *)data;
+#ifdef CONFIG_SMP
+	irq_work_queue_on(&shared->irq_work, smp_processor_id());
+#else
+	irq_work_queue(&shared->irq_work);
+#endif
+}
+
+static void dbs_update_util_handler(struct update_util_data *data, u64 time,
+				    unsigned long util, unsigned long max)
+{
+	struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
 	struct cpu_common_dbs_info *shared = cdbs->shared;
 
 	/*
-	 * Timer handler may not be allowed to queue the work at the moment,
-	 * because:
-	 * - Another timer handler has done that
-	 * - We are stopping the governor
-	 * - Or we are updating the sampling rate of the ondemand governor
+	 * The work may not be allowed to be queued up right now.
+	 * Possible reasons:
+	 * - Work has already been queued up or is in progress.
+	 * - The governor is being stopped.
+	 * - It is too early (too little time from the previous sample).
 	 */
-	if (atomic_inc_return(&shared->skip_work) > 1)
-		atomic_dec(&shared->skip_work);
-	else
-		queue_work(system_wq, &shared->work);
+	if (atomic_inc_return(&shared->skip_work) == 1) {
+		u64 delta_ns;
+
+		delta_ns = time - shared->last_sample_time;
+		if ((s64)delta_ns >= shared->sample_delay_ns) {
+			shared->last_sample_time = time;
+			gov_queue_irq_work(shared);
+			return;
+		}
+	}
+	atomic_dec(&shared->skip_work);
 }
 
 static void set_sampling_rate(struct dbs_data *dbs_data,
@@ -315,6 +305,7 @@ static int alloc_common_dbs_info(struct
 
 	mutex_init(&shared->timer_mutex);
 	atomic_set(&shared->skip_work, 0);
+	init_irq_work(&shared->irq_work, dbs_irq_work);
 	INIT_WORK(&shared->work, dbs_work_handler);
 	return 0;
 }
@@ -467,9 +458,6 @@ static int cpufreq_governor_start(struct
 		io_busy = od_tuners->io_is_busy;
 	}
 
-	shared->policy = policy;
-	shared->time_stamp = ktime_get();
-
 	for_each_cpu(j, policy->cpus) {
 		struct cpu_dbs_info *j_cdbs = cdata->get_cpu_cdbs(j);
 		unsigned int prev_load;
@@ -485,10 +473,9 @@ static int cpufreq_governor_start(struct
 		if (ignore_nice)
 			j_cdbs->prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
 
-		__setup_timer(&j_cdbs->timer, dbs_timer_handler,
-			      (unsigned long)j_cdbs,
-			      TIMER_DEFERRABLE | TIMER_IRQSAFE);
+		j_cdbs->update_util.func = dbs_update_util_handler;
 	}
+	shared->policy = policy;
 
 	if (cdata->governor == GOV_CONSERVATIVE) {
 		struct cs_cpu_dbs_info_s *cs_dbs_info =
@@ -505,7 +492,7 @@ static int cpufreq_governor_start(struct
 		od_ops->powersave_bias_init_cpu(cpu);
 	}
 
-	gov_add_timers(policy, delay_for_sampling_rate(sampling_rate));
+	gov_set_update_util(shared, sampling_rate);
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-pm/drivers/cpufreq/cpufreq_ondemand.c
@@ -191,7 +191,7 @@ static void od_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int od_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int od_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	unsigned int cpu = policy->cpu;
@@ -200,9 +200,6 @@ static unsigned int od_dbs_timer(struct
 	struct od_dbs_tuners *od_tuners = dbs_data->tuners;
 	int delay = 0, sample_type = dbs_info->sample_type;
 
-	if (!modify_all)
-		goto max_delay;
-
 	/* Common NORMAL_SAMPLE setup */
 	dbs_info->sample_type = OD_NORMAL_SAMPLE;
 	if (sample_type == OD_SUB_SAMPLE) {
@@ -218,7 +215,6 @@ static unsigned int od_dbs_timer(struct
 		}
 	}
 
-max_delay:
 	if (!delay)
 		delay = delay_for_sampling_rate(od_tuners->sampling_rate
 				* dbs_info->rate_mult);
@@ -264,7 +260,6 @@ static void update_sampling_rate(struct
 		struct od_cpu_dbs_info_s *dbs_info;
 		struct cpu_dbs_info *cdbs;
 		struct cpu_common_dbs_info *shared;
-		unsigned long next_sampling, appointed_at;
 
 		dbs_info = &per_cpu(od_cpu_dbs_info, cpu);
 		cdbs = &dbs_info->cdbs;
@@ -288,20 +283,28 @@ static void update_sampling_rate(struct
 		 * policy will be governed by dbs_data, otherwise there can be
 		 * multiple policies that are governed by the same dbs_data.
 		 */
-		if (dbs_data != policy->governor_data)
-			continue;
-
-		/*
-		 * Checking this for any CPU should be fine, timers for all of
-		 * them are scheduled together.
-		 */
-		next_sampling = jiffies + usecs_to_jiffies(new_rate);
-		appointed_at = dbs_info->cdbs.timer.expires;
-
-		if (time_before(next_sampling, appointed_at)) {
-			gov_cancel_work(shared);
-			gov_add_timers(policy, usecs_to_jiffies(new_rate));
-
+		if (dbs_data == policy->governor_data) {
+			mutex_lock(&shared->timer_mutex);
+			/*
+			 * On 32-bit architectures this may race with the
+			 * sample_delay_ns read in dbs_update_util_handler(),
+			 * but that really doesn't matter.  If the read returns
+			 * a value that's too big, the sample will be skipped,
+			 * but the next invocation of dbs_update_util_handler()
+			 * (when the update has been completed) will take a
+			 * sample.  If the returned value is too small, the
+			 * sample will be taken immediately, but that isn't a
+			 * problem, as we want the new rate to take effect
+			 * immediately anyway.
+			 *
+			 * If this runs in parallel with dbs_work_handler(), we
+			 * may end up overwriting the sample_delay_ns value that
+			 * it has just written, but the difference should not be
+			 * too big and it will be corrected next time a sample
+			 * is taken, so it shouldn't be significant.
+			 */
+			gov_update_sample_delay(shared, new_rate);
+			mutex_unlock(&shared->timer_mutex);
 		}
 	}
 
Index: linux-pm/drivers/cpufreq/cpufreq_conservative.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_conservative.c
+++ linux-pm/drivers/cpufreq/cpufreq_conservative.c
@@ -115,14 +115,12 @@ static void cs_check_cpu(int cpu, unsign
 	}
 }
 
-static unsigned int cs_dbs_timer(struct cpufreq_policy *policy, bool modify_all)
+static unsigned int cs_dbs_timer(struct cpufreq_policy *policy)
 {
 	struct dbs_data *dbs_data = policy->governor_data;
 	struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
 
-	if (modify_all)
-		dbs_check_cpu(dbs_data, policy->cpu);
-
+	dbs_check_cpu(dbs_data, policy->cpu);
 	return delay_for_sampling_rate(cs_tuners->sampling_rate);
 }
 
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -3,6 +3,7 @@ menu "CPU Frequency scaling"
 config CPU_FREQ
 	bool "CPU Frequency scaling"
 	select SRCU
+	select IRQ_WORK
 	help
 	  CPU Frequency scaling allows you to change the clock speed of 
 	  CPUs on the fly. This is a nice method to save power, because 

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 14:46                   ` Juri Lelli
@ 2016-02-10 15:46                     ` Rafael J. Wysocki
  2016-02-10 16:05                       ` Juri Lelli
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10 15:46 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Steve Muckle,
	Peter Zijlstra, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Thomas Gleixner

On Wed, Feb 10, 2016 at 3:46 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> On 10/02/16 15:26, Rafael J. Wysocki wrote:
>> On Wed, Feb 10, 2016 at 3:03 PM, Juri Lelli <juri.lelli@arm.com> wrote:
>> > On 10/02/16 14:23, Rafael J. Wysocki wrote:
>> >> On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
>> >> > Hi Rafael,
>> >> >
>> >> > On 09/02/16 21:05, Rafael J. Wysocki wrote:
>> >> >
>> >> > [...]
>> >> >
>> >> >> +/**
>> >> >> + * cpufreq_update_util - Take a note about CPU utilization changes.
>> >> >> + * @util: Current utilization.
>> >> >> + * @max: Utilization ceiling.
>> >> >> + *
>> >> >> + * This function is called by the scheduler on every invocation of
>> >> >> + * update_load_avg() on the CPU whose utilization is being updated.
>> >> >> + */
>> >> >> +void cpufreq_update_util(unsigned long util, unsigned long max)
>> >> >> +{
>> >> >> +     struct update_util_data *data;
>> >> >> +
>> >> >> +     rcu_read_lock();
>> >> >> +
>> >> >> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
>> >> >> +     if (data && data->func)
>> >> >> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
>> >> >
>> >> > Are util and max used anywhere?
>> >>
>> >> They aren't yet, but they will be.
>> >>
>> >> Maybe not in this cycle (it it takes too much time to integrate the
>> >> preliminary changes), but we definitely are going to use those
>> >> numbers.
>> >>
>> >
>> > Oh OK. However, I was under the impression that this set was only
>> > proposing a way to get rid of timers and use the scheduler as heartbeat
>> > for cpufreq governors. The governors' sample based approach wouldn't
>> > change, though. Am I wrong in assuming this?
>>
>> Your assumption is correct.
>>
>
> In this case. Wouldn't be possible to simply put the kicks in
> sched/core.c? scheduler_tick() seems a good candidate for that, and you
> could complement that with enqueue/dequeue/etc., if needed.

That can be done, but they are not needed for things like idle and
stop, are they?

> I'm actually wondering if a slow CONFIG_HZ might affect governors'
> sampling rate. We might have scheduler tick firing every 40ms and
> sampling rate set to 10 or 20ms, don't we?

The smallest HZ you can get from the standard config is 100.  That
would translate to an update every 10ms roughly if my understanding of
things is correct.

Also I think that the scheduler and cpufreq should really work at the
same pace as they affect each other in any case.

>> The sample-based approach doesn't change at this time, simply to avoid
>> making too many changes in one go.
>>
>> The next step, as I'm seeing it, would be to use the
>> scheduler-provided utilization in the governor computations instead of
>> the load estimation made by governors themselves.
>>
>
> OK. But, I'm not sure what does this buy us. If the end goal is still to
> do sampling, aren't we better off using the (1 - idle) estimation as
> today?

First of all, we can avoid the need to compute this number entirely if
we use the scheduler-provided one.

Second, what if we come up with a different idea about the CPU
utilization than the scheduler has?  Who's right then?

Finally, the way this number is currently computed by cpufreq is based
on some questionable heuristics (and not just in one place), so maybe
it's better to stop doing that?

Also I didn't say that the *final* goal would be to do sampling.  I
was talking about the next step. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 15:46                     ` Rafael J. Wysocki
@ 2016-02-10 16:05                       ` Juri Lelli
  0 siblings, 0 replies; 134+ messages in thread
From: Juri Lelli @ 2016-02-10 16:05 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Steve Muckle, Peter Zijlstra, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Thomas Gleixner

On 10/02/16 16:46, Rafael J. Wysocki wrote:
> On Wed, Feb 10, 2016 at 3:46 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> > On 10/02/16 15:26, Rafael J. Wysocki wrote:
> >> On Wed, Feb 10, 2016 at 3:03 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> >> > On 10/02/16 14:23, Rafael J. Wysocki wrote:
> >> >> On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> >> >> > Hi Rafael,
> >> >> >
> >> >> > On 09/02/16 21:05, Rafael J. Wysocki wrote:
> >> >> >
> >> >> > [...]
> >> >> >
> >> >> >> +/**
> >> >> >> + * cpufreq_update_util - Take a note about CPU utilization changes.
> >> >> >> + * @util: Current utilization.
> >> >> >> + * @max: Utilization ceiling.
> >> >> >> + *
> >> >> >> + * This function is called by the scheduler on every invocation of
> >> >> >> + * update_load_avg() on the CPU whose utilization is being updated.
> >> >> >> + */
> >> >> >> +void cpufreq_update_util(unsigned long util, unsigned long max)
> >> >> >> +{
> >> >> >> +     struct update_util_data *data;
> >> >> >> +
> >> >> >> +     rcu_read_lock();
> >> >> >> +
> >> >> >> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
> >> >> >> +     if (data && data->func)
> >> >> >> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
> >> >> >
> >> >> > Are util and max used anywhere?
> >> >>
> >> >> They aren't yet, but they will be.
> >> >>
> >> >> Maybe not in this cycle (it it takes too much time to integrate the
> >> >> preliminary changes), but we definitely are going to use those
> >> >> numbers.
> >> >>
> >> >
> >> > Oh OK. However, I was under the impression that this set was only
> >> > proposing a way to get rid of timers and use the scheduler as heartbeat
> >> > for cpufreq governors. The governors' sample based approach wouldn't
> >> > change, though. Am I wrong in assuming this?
> >>
> >> Your assumption is correct.
> >>
> >
> > In this case. Wouldn't be possible to simply put the kicks in
> > sched/core.c? scheduler_tick() seems a good candidate for that, and you
> > could complement that with enqueue/dequeue/etc., if needed.
> 
> That can be done, but they are not needed for things like idle and
> stop, are they?
> 

Sorry, I'm not sure I understand you here. In a NO_HZ system tick will
be stopped when idle.

> > I'm actually wondering if a slow CONFIG_HZ might affect governors'
> > sampling rate. We might have scheduler tick firing every 40ms and
> > sampling rate set to 10 or 20ms, don't we?
> 
> The smallest HZ you can get from the standard config is 100.  That
> would translate to an update every 10ms roughly if my understanding of
> things is correct.
> 

Right. Please, forget my question above :).

> Also I think that the scheduler and cpufreq should really work at the
> same pace as they affect each other in any case.
> 

Makes sense yes.

> >> The sample-based approach doesn't change at this time, simply to avoid
> >> making too many changes in one go.
> >>
> >> The next step, as I'm seeing it, would be to use the
> >> scheduler-provided utilization in the governor computations instead of
> >> the load estimation made by governors themselves.
> >>
> >
> > OK. But, I'm not sure what does this buy us. If the end goal is still to
> > do sampling, aren't we better off using the (1 - idle) estimation as
> > today?
> 
> First of all, we can avoid the need to compute this number entirely if
> we use the scheduler-provided one.
> 
> Second, what if we come up with a different idea about the CPU
> utilization than the scheduler has?  Who's right then?
> 
> Finally, the way this number is currently computed by cpufreq is based
> on some questionable heuristics (and not just in one place), so maybe
> it's better to stop doing that?
> 
> Also I didn't say that the *final* goal would be to do sampling.  I
> was talking about the next step. :-)
> 

Oh, this changes things indeed. :)

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10  3:09               ` Rafael J. Wysocki
@ 2016-02-10 19:47                 ` Steve Muckle
  2016-02-10 21:49                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Steve Muckle @ 2016-02-10 19:47 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On 02/09/2016 07:09 PM, Rafael J. Wysocki wrote:
>>> >> I think additional hooks such as enqueue/dequeue would be needed in
>>> >> RT/DL. The task tick callbacks will only run if a task in that class is
>>> >> executing at the time of the tick. There could be intermittent RT/DL
>>> >> task activity in a frequency domain (the only task activity there, no
>>> >> CFS tasks) that doesn't happen to overlap the tick. Worst case the task
>>> >> activity could be periodic in such a way that it never overlaps the tick
>>> >> and the update is never made.
>> >
>> > So if I'm reading this correctly, it would be better to put the hooks
>> > into update_curr_rt/dl()?

That should AFAICS be sufficient to avoid stalling. It may be more than
is required as that covers more than just enqueue/dequeue but I'm not
sure offhand.

>
> If done this way, I guess we may pass rq_clock_task(rq) as the time
> arg to cpufreq_update_util() from there and then the cpu_lock() call
> I've added to this prototype won't be necessary any more.

Is it rq_clock_task() or rq_clock()? The former can omit irq time so may
gradually fall behind wall clock time, delaying callbacks in cpufreq.

thanks,
Steve

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 19:47                 ` Steve Muckle
@ 2016-02-10 21:49                   ` Rafael J. Wysocki
  2016-02-10 22:07                     ` Steve Muckle
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10 21:49 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Peter Zijlstra,
	Linux PM list, Linux Kernel Mailing List, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Thomas Gleixner

On Wed, Feb 10, 2016 at 8:47 PM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 02/09/2016 07:09 PM, Rafael J. Wysocki wrote:
>>>> >> I think additional hooks such as enqueue/dequeue would be needed in
>>>> >> RT/DL. The task tick callbacks will only run if a task in that class is
>>>> >> executing at the time of the tick. There could be intermittent RT/DL
>>>> >> task activity in a frequency domain (the only task activity there, no
>>>> >> CFS tasks) that doesn't happen to overlap the tick. Worst case the task
>>>> >> activity could be periodic in such a way that it never overlaps the tick
>>>> >> and the update is never made.
>>> >
>>> > So if I'm reading this correctly, it would be better to put the hooks
>>> > into update_curr_rt/dl()?
>
> That should AFAICS be sufficient to avoid stalling. It may be more than
> is required as that covers more than just enqueue/dequeue but I'm not
> sure offhand.
>
>>
>> If done this way, I guess we may pass rq_clock_task(rq) as the time
>> arg to cpufreq_update_util() from there and then the cpu_lock() call
>> I've added to this prototype won't be necessary any more.
>
> Is it rq_clock_task() or rq_clock()? The former can omit irq time so may
> gradually fall behind wall clock time, delaying callbacks in cpufreq.

What matters to us is the difference between the current time and the
time we previously took a sample and there shouldn't be too much
difference between the two in that respect.

Both are good enough IMO, but I can update the patch to use rq_clock()
if that's preferred.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 21:49                   ` Rafael J. Wysocki
@ 2016-02-10 22:07                     ` Steve Muckle
  2016-02-10 22:12                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Steve Muckle @ 2016-02-10 22:07 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On 02/10/2016 01:49 PM, Rafael J. Wysocki wrote:
>>> If done this way, I guess we may pass rq_clock_task(rq) as the time
>>> >> arg to cpufreq_update_util() from there and then the cpu_lock() call
>>> >> I've added to this prototype won't be necessary any more.
>> >
>> > Is it rq_clock_task() or rq_clock()? The former can omit irq time so may
>> > gradually fall behind wall clock time, delaying callbacks in cpufreq.
>
> What matters to us is the difference between the current time and the
> time we previously took a sample and there shouldn't be too much
> difference between the two in that respect.

Sorry, the reference to wall clock time was unnecessary. I just meant it
can lose time, which could cause cpufreq updates to be delayed during
irq heavy periods.

> Both are good enough IMO, but I can update the patch to use rq_clock()
> if that's preferred.

I do believe rq_clock should be used as workloads such as heavy
networking could spend a significant portion of time in interrupts,
skewing rq_clock_task significantly, assuming I understand it correctly.

thanks,
Steve

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 22:07                     ` Steve Muckle
@ 2016-02-10 22:12                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10 22:12 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Peter Zijlstra,
	Linux PM list, Linux Kernel Mailing List, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Thomas Gleixner

On Wed, Feb 10, 2016 at 11:07 PM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 02/10/2016 01:49 PM, Rafael J. Wysocki wrote:
>>>> If done this way, I guess we may pass rq_clock_task(rq) as the time
>>>> >> arg to cpufreq_update_util() from there and then the cpu_lock() call
>>>> >> I've added to this prototype won't be necessary any more.
>>> >
>>> > Is it rq_clock_task() or rq_clock()? The former can omit irq time so may
>>> > gradually fall behind wall clock time, delaying callbacks in cpufreq.
>>
>> What matters to us is the difference between the current time and the
>> time we previously took a sample and there shouldn't be too much
>> difference between the two in that respect.
>
> Sorry, the reference to wall clock time was unnecessary. I just meant it
> can lose time, which could cause cpufreq updates to be delayed during
> irq heavy periods.
>
>> Both are good enough IMO, but I can update the patch to use rq_clock()
>> if that's preferred.
>
> I do believe rq_clock should be used as workloads such as heavy
> networking could spend a significant portion of time in interrupts,
> skewing rq_clock_task significantly, assuming I understand it correctly.

OK, I'll send an update, then.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH v7 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-10 15:21   ` [PATCH v6 1/3] cpufreq: Add mechanism for registering " Rafael J. Wysocki
@ 2016-02-10 23:01     ` Rafael J. Wysocki
  2016-02-11 17:30       ` [PATCH v8 " Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10 23:01 UTC (permalink / raw)
  To: Linux PM list, Ingo Molnar
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.

This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.

The update_load_avg() changes were suggested by Peter Zijlstra.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
---

Changes from v6:
- Steve suggested to use rq_clock() instead of rq_clock_task() as the time
  argument for cpufreq_update_util() as that seems to be more suitable for
  this purpose.

Thanks,
Rafael

---
 drivers/cpufreq/cpufreq.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   |   17 +++++++++++++++++
 kernel/sched/deadline.c   |    3 +++
 kernel/sched/fair.c       |   26 +++++++++++++++++++++++++-
 kernel/sched/rt.c         |    3 +++
 kernel/sched/sched.h      |    1 +
 6 files changed, 94 insertions(+), 1 deletion(-)

Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -151,6 +151,19 @@ static inline bool policy_is_shared(stru
 extern struct kobject *cpufreq_global_kobject;
 
 #ifdef CONFIG_CPU_FREQ
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+static inline void cpufreq_trigger_update(u64 time)
+{
+	cpufreq_update_util(time, ULONG_MAX, 0);
+}
+
+struct update_util_data {
+	void (*func)(struct update_util_data *data,
+		     u64 time, unsigned long util, unsigned long max);
+};
+
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+
 unsigned int cpufreq_get(unsigned int cpu);
 unsigned int cpufreq_quick_get(unsigned int cpu);
 unsigned int cpufreq_quick_get_max(unsigned int cpu);
@@ -162,6 +175,10 @@ int cpufreq_update_policy(unsigned int c
 bool have_governor_per_policy(void);
 struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
 #else
+static inline void cpufreq_update_util(u64 time, unsigned long util,
+				       unsigned long max) {}
+static inline void cpufreq_trigger_update(u64 time) {}
+
 static inline unsigned int cpufreq_get(unsigned int cpu)
 {
 	return 0;
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -9,6 +9,7 @@
 #include <linux/irq_work.h>
 #include <linux/tick.h>
 #include <linux/slab.h>
+#include <linux/cpufreq.h>
 
 #include "cpupri.h"
 #include "cpudeadline.h"
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2824,7 +2824,8 @@ static inline void update_load_avg(struc
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
-	int cpu = cpu_of(rq_of(cfs_rq));
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -2836,6 +2837,29 @@ static inline void update_load_avg(struc
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
+
+	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+		unsigned long max = rq->cpu_capacity_orig;
+
+		/*
+		 * There are a few boundary cases this might miss but it should
+		 * get called often enough that that should (hopefully) not be
+		 * a real problem -- added to that it only calls on the local
+		 * CPU, so if we enqueue remotely we'll miss an update, but
+		 * the next tick/schedule should update.
+		 *
+		 * It will not get called when we go idle, because the idle
+		 * thread is a different class (!fair), nor will the utilization
+		 * number include things like RT tasks.
+		 *
+		 * As is, the util number is not freq-invariant (we'd have to
+		 * implement arch_scale_freq_capacity() for that).
+		 *
+		 * See cpu_util().
+		 */
+		cpufreq_update_util(rq_clock(rq),
+				    min(cfs_rq->avg.util_avg, max), max);
+	}
 }
 
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
Index: linux-pm/kernel/sched/deadline.c
===================================================================
--- linux-pm.orig/kernel/sched/deadline.c
+++ linux-pm/kernel/sched/deadline.c
@@ -726,6 +726,9 @@ static void update_curr_dl(struct rq *rq
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
 
+	/* Kick a cpufreq update to prevent it from stalling. */
+	cpufreq_trigger_update(rq_clock(rq));
+
 	/*
 	 * Consumed budget is computed considering the time as
 	 * observed by schedulable tasks (excluding time spent
Index: linux-pm/kernel/sched/rt.c
===================================================================
--- linux-pm.orig/kernel/sched/rt.c
+++ linux-pm/kernel/sched/rt.c
@@ -949,6 +949,9 @@ static void update_curr_rt(struct rq *rq
 	if (unlikely((s64)delta_exec <= 0))
 		return;
 
+	/* Kick a cpufreq update to prevent it from stalling. */
+	cpufreq_trigger_update(rq_clock(rq));
+
 	schedstat_set(curr->se.statistics.exec_max,
 		      max(curr->se.statistics.exec_max, delta_exec));
 
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -102,6 +102,51 @@ static LIST_HEAD(cpufreq_governor_list);
 static struct cpufreq_driver *cpufreq_driver;
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
 static DEFINE_RWLOCK(cpufreq_driver_lock);
+
+static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @data: New pointer value.
+ *
+ * Set and publish the update_util_data pointer for the given CPU.  That pointer
+ * points to a struct update_util_data object containing a callback function
+ * to call from cpufreq_update_util().  That function will be called from an RCU
+ * read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU callbacks to free any memory that might be accessed
+ * via the old update_util_data pointer or invoke synchronize_rcu() right after
+ * this function to avoid use-after-free.
+ */
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+{
+	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @time: Current time.
+ * @util: Current utilization.
+ * @max: Utilization ceiling.
+ *
+ * This function is called by the scheduler on every invocation of
+ * update_load_avg() on the CPU whose utilization is being updated.
+ */
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+	struct update_util_data *data;
+
+	rcu_read_lock();
+
+	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
+	if (data && data->func)
+		data->func(data, time, util, max);
+
+	rcu_read_unlock();
+}
+
 DEFINE_MUTEX(cpufreq_governor_lock);
 
 /* Flag to suspend/resume CPUFreq governors */

^ permalink raw reply	[flat|nested] 134+ messages in thread

* RE: [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 15:17 ` [PATCH v6 " Rafael J. Wysocki
                     ` (2 preceding siblings ...)
  2016-02-10 15:36   ` [PATCH v6 3/3] cpufreq: governor: " Rafael J. Wysocki
@ 2016-02-10 23:11   ` Doug Smythies
  2016-02-10 23:17     ` Rafael J. Wysocki
  2016-02-11  6:02     ` Srinivas Pandruvada
  3 siblings, 2 replies; 134+ messages in thread
From: Doug Smythies @ 2016-02-10 23:11 UTC (permalink / raw)
  To: 'Rafael J. Wysocki', 'Linux PM list',
	'Ingo Molnar'
  Cc: 'Linux Kernel Mailing List', 'Peter Zijlstra',
	'Srinivas Pandruvada', 'Viresh Kumar',
	'Juri Lelli', 'Steve Muckle',
	'Thomas Gleixner'

On 2016.02.10 07:17 Rafael J. Wysocki wrote:
> On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
>>
>> The following patch series introduces a mechanism allowing the cpufreq core
>> and "setpolicy" drivers to provide utilization update callbacks to be invoked
>> by the scheduler on utilization changes.  Those callbacks can be used to run
>> the sampling and frequency adjustments code (intel_pstate) or to schedule the
>> execution of that code in process context (cpufreq core) instead of per-CPU
>> deferrable timers used in cpufreq today (which Thomas complained about during
>> the last Kernel Summit).

This patch set solves a long standing issue with the intel_pstate driver.
The issue began with the introduction of the "duration" method for deciding
if the CPU had been idle for a long time resulting in forcing the
target pstate downwards. Often this was the correct action, but sometimes this
was the wrong thing to do, because the cpu was actually very busy, but just so
happened to be idle on jiffy boundaries (perhaps similar to what Steve Muckle
was referring to on another branch of this thread).

For an idle system, this patch set seems to change the maximum duration from
4 seconds to 0.5 seconds for most CPUs. However, when using v1 of patches 1
and 2 of 3 and v5 of 3 of 3, sometimes the durations (time between passes of
the intel-pstate driver for a given CPU) of upwards of 120 seconds were observed.
When patches 1, 2, and 3 of 3 v6 were used, the maximum observed durations of an
idle system were on the order of 500 milliseconds for most CPUs, but CPU 6
sometimes went to 3.5 seconds and CPU 7 sometimes went to 4 seconds (small
sample space, I'll consider to run an overnight test for a much much larger
sample space). Note 4 seconds, is O.K., and what it was before, I'm just noting
it is all.

I have a bunch of graphs, if anyone wants to see the supporting data.

My test computer has an older model i7 (Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz)

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 23:11   ` [PATCH v6 0/3] cpufreq: " Doug Smythies
@ 2016-02-10 23:17     ` Rafael J. Wysocki
  2016-02-11 22:50       ` Doug Smythies
  2016-02-11  6:02     ` Srinivas Pandruvada
  1 sibling, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-10 23:17 UTC (permalink / raw)
  To: Doug Smythies
  Cc: 'Linux PM list', 'Ingo Molnar',
	'Linux Kernel Mailing List', 'Peter Zijlstra',
	'Srinivas Pandruvada', 'Viresh Kumar',
	'Juri Lelli', 'Steve Muckle',
	'Thomas Gleixner'

On Wednesday, February 10, 2016 03:11:43 PM Doug Smythies wrote:
> On 2016.02.10 07:17 Rafael J. Wysocki wrote:
> > On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
> >>
> >> The following patch series introduces a mechanism allowing the cpufreq core
> >> and "setpolicy" drivers to provide utilization update callbacks to be invoked
> >> by the scheduler on utilization changes.  Those callbacks can be used to run
> >> the sampling and frequency adjustments code (intel_pstate) or to schedule the
> >> execution of that code in process context (cpufreq core) instead of per-CPU
> >> deferrable timers used in cpufreq today (which Thomas complained about during
> >> the last Kernel Summit).
> 
> This patch set solves a long standing issue with the intel_pstate driver.

Good to hear that, thanks!

> The issue began with the introduction of the "duration" method for deciding
> if the CPU had been idle for a long time resulting in forcing the
> target pstate downwards. Often this was the correct action, but sometimes this
> was the wrong thing to do, because the cpu was actually very busy, but just so
> happened to be idle on jiffy boundaries (perhaps similar to what Steve Muckle
> was referring to on another branch of this thread).
> 
> For an idle system, this patch set seems to change the maximum duration from
> 4 seconds to 0.5 seconds for most CPUs. However, when using v1 of patches 1
> and 2 of 3 and v5 of 3 of 3, sometimes the durations (time between passes of
> the intel-pstate driver for a given CPU) of upwards of 120 seconds were observed.
> When patches 1, 2, and 3 of 3 v6 were used, the maximum observed durations of an
> idle system were on the order of 500 milliseconds for most CPUs, but CPU 6
> sometimes went to 3.5 seconds and CPU 7 sometimes went to 4 seconds (small
> sample space, I'll consider to run an overnight test for a much much larger
> sample space). Note 4 seconds, is O.K., and what it was before, I'm just noting
> it is all.
> 
> I have a bunch of graphs, if anyone wants to see the supporting data.

It would be good to see how the data with and without the patchset compare
to each other if you have that.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 23:11   ` [PATCH v6 0/3] cpufreq: " Doug Smythies
  2016-02-10 23:17     ` Rafael J. Wysocki
@ 2016-02-11  6:02     ` Srinivas Pandruvada
  1 sibling, 0 replies; 134+ messages in thread
From: Srinivas Pandruvada @ 2016-02-11  6:02 UTC (permalink / raw)
  To: Doug Smythies, 'Rafael J. Wysocki',
	'Linux PM list', 'Ingo Molnar'
  Cc: 'Linux Kernel Mailing List', 'Peter Zijlstra',
	'Viresh Kumar', 'Juri Lelli',
	'Steve Muckle', 'Thomas Gleixner'



On 02/10/2016 03:11 PM, Doug Smythies wrote:
> On 2016.02.10 07:17 Rafael J. Wysocki wrote:
>> On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
>>> The following patch series introduces a mechanism allowing the cpufreq core
>>> and "setpolicy" drivers to provide utilization update callbacks to be invoked
>>> by the scheduler on utilization changes.  Those callbacks can be used to run
>>> the sampling and frequency adjustments code (intel_pstate) or to schedule the
>>> execution of that code in process context (cpufreq core) instead of per-CPU
>>> deferrable timers used in cpufreq today (which Thomas complained about during
>>> the last Kernel Summit).
> This patch set solves a long standing issue with the intel_pstate driver.
> The issue began with the introduction of the "duration" method for deciding
> if the CPU had been idle for a long time resulting in forcing the
> target pstate downwards. Often this was the correct action, but sometimes this
> was the wrong thing to do, because the cpu was actually very busy, but just so
> happened to be idle on jiffy boundaries (perhaps similar to what Steve Muckle
> was referring to on another branch of this thread).
>
> For an idle system, this patch set seems to change the maximum duration from
> 4 seconds to 0.5 seconds for most CPUs. However, when using v1 of patches 1
> and 2 of 3 and v5 of 3 of 3, sometimes the durations (time between passes of
> the intel-pstate driver for a given CPU) of upwards of 120 seconds were observed.
> When patches 1, 2, and 3 of 3 v6 were used, the maximum observed durations of an
> idle system were on the order of 500 milliseconds for most CPUs, but CPU 6
> sometimes went to 3.5 seconds and CPU 7 sometimes went to 4 seconds (small
> sample space, I'll consider to run an overnight test for a much much larger
> sample space). Note 4 seconds, is O.K., and what it was before, I'm just noting
> it is all.
>
> I have a bunch of graphs, if anyone wants to see the supporting data.
>
> My test computer has an older model i7 (Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz)
Thanks Doug. If you have specific workloads, please compare performance.

- Srinivas
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-09 20:05         ` Rafael J. Wysocki
  2016-02-10  1:02           ` Steve Muckle
  2016-02-10 12:33           ` Juri Lelli
@ 2016-02-11 11:51           ` Peter Zijlstra
  2016-02-11 12:08             ` Rafael J. Wysocki
  2 siblings, 1 reply; 134+ messages in thread
From: Peter Zijlstra @ 2016-02-11 11:51 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Steve Muckle, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Tue, Feb 09, 2016 at 09:05:05PM +0100, Rafael J. Wysocki wrote:
> > > One concern I had was, given that the lone scheduler update hook is in
> > > CFS, is it possible for governor updates to be stalled due to RT or DL
> > > task activity?
> > 
> > I don't think they may be completely stalled, but I'd prefer Peter to
> > answer that as he suggested to do it this way.
> 
> In any case, if that concern turns out to be significant in practice, it may
> be addressed like in the appended modification of patch [1/3] from the $subject
> series.
> 
> With that things look like before from the cpufreq side, but the other sched
> classes also get a chance to trigger a cpufreq update.  The drawback is the
> cpu_clock() call instead of passing the time value from update_load_avg(), but
> I guess we can live with that if necessary.
> 
> FWIW, this modification doesn't seem to break things on my test machine.

Not really pretty though. It blows a bit that you require this callback
to be periodic (in order to replace a timer).

Ideally we'd not have to call this if state doesn't change.


> +++ linux-pm/include/linux/sched.h
> @@ -3207,4 +3207,11 @@ static inline unsigned long rlimit_max(u
>  	return task_rlimit_max(current, limit);
>  }
>  
> +void cpufreq_update_util(unsigned long util, unsigned long max);

Didn't you have a timestamp in there?

> +
> +static inline void cpufreq_kick(void)
> +{
> +	cpufreq_update_util(ULONG_MAX, ULONG_MAX);
> +}
> +
>  #endif

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10  1:02           ` Steve Muckle
  2016-02-10  1:57             ` Rafael J. Wysocki
@ 2016-02-11 11:59             ` Peter Zijlstra
  2016-02-11 12:24               ` Juri Lelli
  2016-02-11 17:06               ` Steve Muckle
  1 sibling, 2 replies; 134+ messages in thread
From: Peter Zijlstra @ 2016-02-11 11:59 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Tue, Feb 09, 2016 at 05:02:33PM -0800, Steve Muckle wrote:
> > Index: linux-pm/kernel/sched/deadline.c
> > ===================================================================
> > --- linux-pm.orig/kernel/sched/deadline.c
> > +++ linux-pm/kernel/sched/deadline.c
> > @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
> >  {
> >  	update_curr_dl(rq);
> >  
> > +	/* Kick cpufreq to prevent it from stalling. */
> > +	cpufreq_kick();
> > +
> >  	/*
> >  	 * Even when we have runtime, update_curr_dl() might have resulted in us
> >  	 * not being the leftmost task anymore. In that case NEED_RESCHED will
> 
> I think additional hooks such as enqueue/dequeue would be needed in
> RT/DL. The task tick callbacks will only run if a task in that class is
> executing at the time of the tick. There could be intermittent RT/DL
> task activity in a frequency domain (the only task activity there, no
> CFS tasks) that doesn't happen to overlap the tick. Worst case the task
> activity could be periodic in such a way that it never overlaps the tick
> and the update is never made.

No, for RT (RR/FIFO) we do not have enough information to do anything
useful. Basically RR/FIFO should result in running 100% whenever we
schedule such a task.

That means RR/FIFO want a hook in pick_next_task_rt() to bump the freq
to 100% and leave it there until something else gets to run.

For DL it basically wants to set a minimum freq based on reserved
utilization, so that is __setparam_dl() or somewhere around there.

And we should either use CPPC hints for min freq or manually ensure that
the CFS callback will not select something less than this.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 11:51           ` Peter Zijlstra
@ 2016-02-11 12:08             ` Rafael J. Wysocki
  2016-02-11 15:29               ` Peter Zijlstra
  2016-02-11 20:47               ` Rafael J. Wysocki
  0 siblings, 2 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-11 12:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Steve Muckle, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Thu, Feb 11, 2016 at 12:51 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Feb 09, 2016 at 09:05:05PM +0100, Rafael J. Wysocki wrote:
>> > > One concern I had was, given that the lone scheduler update hook is in
>> > > CFS, is it possible for governor updates to be stalled due to RT or DL
>> > > task activity?
>> >
>> > I don't think they may be completely stalled, but I'd prefer Peter to
>> > answer that as he suggested to do it this way.
>>
>> In any case, if that concern turns out to be significant in practice, it may
>> be addressed like in the appended modification of patch [1/3] from the $subject
>> series.
>>
>> With that things look like before from the cpufreq side, but the other sched
>> classes also get a chance to trigger a cpufreq update.  The drawback is the
>> cpu_clock() call instead of passing the time value from update_load_avg(), but
>> I guess we can live with that if necessary.
>>
>> FWIW, this modification doesn't seem to break things on my test machine.
>
> Not really pretty though. It blows a bit that you require this callback
> to be periodic (in order to replace a timer).

We need it for now, but that's because of how things work on the cpufreq side.

> Ideally we'd not have to call this if state doesn't change.

When cpufreq starts to use the util numbers, things will work like
that pretty much automatically.

We'll need to avoid thrashing if there are too many state changes over
a short time, but that's a different problem.

>> +++ linux-pm/include/linux/sched.h
>> @@ -3207,4 +3207,11 @@ static inline unsigned long rlimit_max(u
>>       return task_rlimit_max(current, limit);
>>  }
>>
>> +void cpufreq_update_util(unsigned long util, unsigned long max);
>
> Didn't you have a timestamp in there?

I did and I still do in fact.

The last version is here:

https://patchwork.kernel.org/patch/8275271/

but it has the additional hooks for RT/DL which you seem to be
thinking are a mistake.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 11:59             ` Peter Zijlstra
@ 2016-02-11 12:24               ` Juri Lelli
  2016-02-11 15:26                 ` Peter Zijlstra
  2016-02-11 17:06               ` Steve Muckle
  1 sibling, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-02-11 12:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Thomas Gleixner

Hi Peter,

On 11/02/16 12:59, Peter Zijlstra wrote:
> On Tue, Feb 09, 2016 at 05:02:33PM -0800, Steve Muckle wrote:
> > > Index: linux-pm/kernel/sched/deadline.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/sched/deadline.c
> > > +++ linux-pm/kernel/sched/deadline.c
> > > @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
> > >  {
> > >  	update_curr_dl(rq);
> > >  
> > > +	/* Kick cpufreq to prevent it from stalling. */
> > > +	cpufreq_kick();
> > > +
> > >  	/*
> > >  	 * Even when we have runtime, update_curr_dl() might have resulted in us
> > >  	 * not being the leftmost task anymore. In that case NEED_RESCHED will
> > 
> > I think additional hooks such as enqueue/dequeue would be needed in
> > RT/DL. The task tick callbacks will only run if a task in that class is
> > executing at the time of the tick. There could be intermittent RT/DL
> > task activity in a frequency domain (the only task activity there, no
> > CFS tasks) that doesn't happen to overlap the tick. Worst case the task
> > activity could be periodic in such a way that it never overlaps the tick
> > and the update is never made.
> 
> No, for RT (RR/FIFO) we do not have enough information to do anything
> useful. Basically RR/FIFO should result in running 100% whenever we
> schedule such a task.
> 
> That means RR/FIFO want a hook in pick_next_task_rt() to bump the freq
> to 100% and leave it there until something else gets to run.
> 

Vincent is trying to play with rt_avg (in the last sched-freq thread) to
see if we can get some information about RT as well. I understand that
from a theoretical perspective that's not much we can say of such tasks,
and bumping to max can be the only sensible thing to do, but there are
users of RT (ehm, Android) that will probably see differences in energy
consumption if we do so. Yeah, maybe the should use a different policy,
yes.

> For DL it basically wants to set a minimum freq based on reserved
> utilization, so that is __setparam_dl() or somewhere around there.
> 

I think we could do better than this once Luca's reclaiming stuff gets
in. The reserved bw is usually somewhat pessimistic. But this is a
different discussion, maybe.

Best,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 12:24               ` Juri Lelli
@ 2016-02-11 15:26                 ` Peter Zijlstra
  2016-02-11 18:23                   ` Vincent Guittot
  0 siblings, 1 reply; 134+ messages in thread
From: Peter Zijlstra @ 2016-02-11 15:26 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steve Muckle, Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Thomas Gleixner

On Thu, Feb 11, 2016 at 12:24:29PM +0000, Juri Lelli wrote:
> Hi Peter,
> 
> On 11/02/16 12:59, Peter Zijlstra wrote:
> > On Tue, Feb 09, 2016 at 05:02:33PM -0800, Steve Muckle wrote:
> > > > Index: linux-pm/kernel/sched/deadline.c
> > > > ===================================================================
> > > > --- linux-pm.orig/kernel/sched/deadline.c
> > > > +++ linux-pm/kernel/sched/deadline.c
> > > > @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
> > > >  {
> > > >  	update_curr_dl(rq);
> > > >  
> > > > +	/* Kick cpufreq to prevent it from stalling. */
> > > > +	cpufreq_kick();
> > > > +
> > > >  	/*
> > > >  	 * Even when we have runtime, update_curr_dl() might have resulted in us
> > > >  	 * not being the leftmost task anymore. In that case NEED_RESCHED will
> > > 
> > > I think additional hooks such as enqueue/dequeue would be needed in
> > > RT/DL. The task tick callbacks will only run if a task in that class is
> > > executing at the time of the tick. There could be intermittent RT/DL
> > > task activity in a frequency domain (the only task activity there, no
> > > CFS tasks) that doesn't happen to overlap the tick. Worst case the task
> > > activity could be periodic in such a way that it never overlaps the tick
> > > and the update is never made.
> > 
> > No, for RT (RR/FIFO) we do not have enough information to do anything
> > useful. Basically RR/FIFO should result in running 100% whenever we
> > schedule such a task.
> > 
> > That means RR/FIFO want a hook in pick_next_task_rt() to bump the freq
> > to 100% and leave it there until something else gets to run.
> > 
> 
> Vincent is trying to play with rt_avg (in the last sched-freq thread) to
> see if we can get some information about RT as well. I understand that
> from a theoretical perspective that's not much we can say of such tasks,
> and bumping to max can be the only sensible thing to do, but there are
> users of RT (ehm, Android) that will probably see differences in energy
> consumption if we do so. Yeah, maybe the should use a different policy,
> yes.

Can't we just leave broken people get broken results? Trying to use
rt_avg for this is just insane. We should ensure that people using this
thing correctly get correct results, the rest can take a hike.

Using rt_avg gets us to the place where people who want to do the right
thing cannot, and that is bad.

> > For DL it basically wants to set a minimum freq based on reserved
> > utilization, so that is __setparam_dl() or somewhere around there.
> > 
> 
> I think we could do better than this once Luca's reclaiming stuff gets
> in. The reserved bw is usually somewhat pessimistic. But this is a
> different discussion, maybe.

Sure, there's cleverer things that can be done. But a simple one would
indeed be the min guarantee based on accepted bandwidth.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 12:08             ` Rafael J. Wysocki
@ 2016-02-11 15:29               ` Peter Zijlstra
  2016-02-11 15:58                 ` Rafael J. Wysocki
  2016-02-11 20:47               ` Rafael J. Wysocki
  1 sibling, 1 reply; 134+ messages in thread
From: Peter Zijlstra @ 2016-02-11 15:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Steve Muckle, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Thu, Feb 11, 2016 at 01:08:28PM +0100, Rafael J. Wysocki wrote:
> > Not really pretty though. It blows a bit that you require this callback
> > to be periodic (in order to replace a timer).
> 
> We need it for now, but that's because of how things work on the cpufreq side.

Right, maybe stick a big comment on cpufreq_trigger_update() noting its
a big ugly hack and will go away 'soon'.

> The last version is here:
> 
> https://patchwork.kernel.org/patch/8275271/
> 
> but it has the additional hooks for RT/DL which you seem to be
> thinking are a mistake.

As long as we make sure everbody knows they're a band-aid and will be
taken out back and shot that should be fine for a little while I
suppose.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 15:29               ` Peter Zijlstra
@ 2016-02-11 15:58                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-11 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Steve Muckle, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Thu, Feb 11, 2016 at 4:29 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 11, 2016 at 01:08:28PM +0100, Rafael J. Wysocki wrote:
>> > Not really pretty though. It blows a bit that you require this callback
>> > to be periodic (in order to replace a timer).
>>
>> We need it for now, but that's because of how things work on the cpufreq side.
>
> Right, maybe stick a big comment on cpufreq_trigger_update() noting its
> a big ugly hack and will go away 'soon'.

I will.

>> The last version is here:
>>
>> https://patchwork.kernel.org/patch/8275271/
>>
>> but it has the additional hooks for RT/DL which you seem to be
>> thinking are a mistake.
>
> As long as we make sure everbody knows they're a band-aid and will be
> taken out back and shot that should be fine for a little while I
> suppose.

Great, thanks!

Yes, I'm treating those as a band-aid for replacement.

Let me update the patch with a comment to explain that.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 11:59             ` Peter Zijlstra
  2016-02-11 12:24               ` Juri Lelli
@ 2016-02-11 17:06               ` Steve Muckle
  2016-02-11 17:30                 ` Peter Zijlstra
  1 sibling, 1 reply; 134+ messages in thread
From: Steve Muckle @ 2016-02-11 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

Hi Peter,

On 02/11/2016 03:59 AM, Peter Zijlstra wrote:
>> I think additional hooks such as enqueue/dequeue would be needed in
>> > RT/DL. The task tick callbacks will only run if a task in that class is
>> > executing at the time of the tick. There could be intermittent RT/DL
>> > task activity in a frequency domain (the only task activity there, no
>> > CFS tasks) that doesn't happen to overlap the tick. Worst case the task
>> > activity could be periodic in such a way that it never overlaps the tick
>> > and the update is never made.
>
> No, for RT (RR/FIFO) we do not have enough information to do anything
> useful. Basically RR/FIFO should result in running 100% whenever we
> schedule such a task.
> 
> That means RR/FIFO want a hook in pick_next_task_rt() to bump the freq
> to 100% and leave it there until something else gets to run.
>
> For DL it basically wants to set a minimum freq based on reserved
> utilization, so that is __setparam_dl() or somewhere around there.
> 
> And we should either use CPPC hints for min freq or manually ensure that
> the CFS callback will not select something less than this.

Rafael's changes aren't specifying particular frequencies/capacities in
the scheduler hooks. They're just pokes to get cpufreq to run, in order
to eliminate cpufreq's timers.

My concern above is that pokes are guaranteed to keep occurring when
there is only RT or DL activity so nothing breaks.

thanks,
Steve

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 17:06               ` Steve Muckle
@ 2016-02-11 17:30                 ` Peter Zijlstra
  2016-02-11 17:34                   ` Rafael J. Wysocki
  2016-02-11 18:52                   ` Steve Muckle
  0 siblings, 2 replies; 134+ messages in thread
From: Peter Zijlstra @ 2016-02-11 17:30 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Thu, Feb 11, 2016 at 09:06:04AM -0800, Steve Muckle wrote:
> Hi Peter,
> 
> >> > I think additional hooks such as enqueue/dequeue would be needed in
> >> > RT/DL.

That is what I reacted to mostly. Enqueue/dequeue hooks don't really
make much sense for RT / DL.

> Rafael's changes aren't specifying particular frequencies/capacities in
> the scheduler hooks. They're just pokes to get cpufreq to run, in order
> to eliminate cpufreq's timers.
> 
> My concern above is that pokes are guaranteed to keep occurring when
> there is only RT or DL activity so nothing breaks.

The hook in their respective tick handler should ensure stuff is called
sporadically and isn't stalled.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH v8 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-10 23:01     ` [PATCH v7 " Rafael J. Wysocki
@ 2016-02-11 17:30       ` Rafael J. Wysocki
  2016-02-12 13:16         ` [PATCH v9 " Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-11 17:30 UTC (permalink / raw)
  To: Linux PM list, Peter Zijlstra
  Cc: Ingo Molnar, Linux Kernel Mailing List, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.

This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.

The update_load_avg() changes were suggested by Peter Zijlstra.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
---

Changes from v7
- cpufreq_trigger_update() has a kerneldoc describing it as a band-aid to
  be replaced in the future and the comments next to its call sites ask
  the reader to see that comment.

  No functional changes. 

Changes from v6:
- Steve suggested to use rq_clock() instead of rq_clock_task() as the time
  argument for cpufreq_update_util() as that seems to be more suitable for
  this purpose.

---
 drivers/cpufreq/cpufreq.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   |   34 ++++++++++++++++++++++++++++++++++
 kernel/sched/deadline.c   |    3 +++
 kernel/sched/fair.c       |   26 +++++++++++++++++++++++++-
 kernel/sched/rt.c         |    3 +++
 kernel/sched/sched.h      |    1 +
 6 files changed, 111 insertions(+), 1 deletion(-)

Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -151,6 +151,36 @@ static inline bool policy_is_shared(stru
 extern struct kobject *cpufreq_global_kobject;
 
 #ifdef CONFIG_CPU_FREQ
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+
+/**
+ * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
+ * @time: Current time.
+ *
+ * The way cpufreq is currently arranged requires it to evaluate the CPU
+ * performance state (frequency/voltage) on a regular basis to prevent it from
+ * being stuck in a completely inadequate performance level for too long.
+ * That is not guaranteed to happen if the updates are only triggered from CFS,
+ * though, because they may not be coming in if RT or deadline tasks are active
+ * all the time.
+ *
+ * As a workaround for that issue, this function is called by the RT and DL
+ * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * but that really is a band-aid.  Going forward it should be replaced with
+ * solutions targeted more specifically at RT and DL tasks.
+ */
+static inline void cpufreq_trigger_update(u64 time)
+{
+	cpufreq_update_util(time, ULONG_MAX, 0);
+}
+
+struct update_util_data {
+	void (*func)(struct update_util_data *data,
+		     u64 time, unsigned long util, unsigned long max);
+};
+
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+
 unsigned int cpufreq_get(unsigned int cpu);
 unsigned int cpufreq_quick_get(unsigned int cpu);
 unsigned int cpufreq_quick_get_max(unsigned int cpu);
@@ -162,6 +192,10 @@ int cpufreq_update_policy(unsigned int c
 bool have_governor_per_policy(void);
 struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
 #else
+static inline void cpufreq_update_util(u64 time, unsigned long util,
+				       unsigned long max) {}
+static inline void cpufreq_trigger_update(u64 time) {}
+
 static inline unsigned int cpufreq_get(unsigned int cpu)
 {
 	return 0;
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -9,6 +9,7 @@
 #include <linux/irq_work.h>
 #include <linux/tick.h>
 #include <linux/slab.h>
+#include <linux/cpufreq.h>
 
 #include "cpupri.h"
 #include "cpudeadline.h"
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2824,7 +2824,8 @@ static inline void update_load_avg(struc
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
-	int cpu = cpu_of(rq_of(cfs_rq));
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -2836,6 +2837,29 @@ static inline void update_load_avg(struc
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
+
+	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+		unsigned long max = rq->cpu_capacity_orig;
+
+		/*
+		 * There are a few boundary cases this might miss but it should
+		 * get called often enough that that should (hopefully) not be
+		 * a real problem -- added to that it only calls on the local
+		 * CPU, so if we enqueue remotely we'll miss an update, but
+		 * the next tick/schedule should update.
+		 *
+		 * It will not get called when we go idle, because the idle
+		 * thread is a different class (!fair), nor will the utilization
+		 * number include things like RT tasks.
+		 *
+		 * As is, the util number is not freq-invariant (we'd have to
+		 * implement arch_scale_freq_capacity() for that).
+		 *
+		 * See cpu_util().
+		 */
+		cpufreq_update_util(rq_clock(rq),
+				    min(cfs_rq->avg.util_avg, max), max);
+	}
 }
 
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
Index: linux-pm/kernel/sched/deadline.c
===================================================================
--- linux-pm.orig/kernel/sched/deadline.c
+++ linux-pm/kernel/sched/deadline.c
@@ -726,6 +726,9 @@ static void update_curr_dl(struct rq *rq
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
 
+	/* Kick cpufreq (see the comment in linux/cpufreq.h). */
+	cpufreq_trigger_update(rq_clock(rq));
+
 	/*
 	 * Consumed budget is computed considering the time as
 	 * observed by schedulable tasks (excluding time spent
Index: linux-pm/kernel/sched/rt.c
===================================================================
--- linux-pm.orig/kernel/sched/rt.c
+++ linux-pm/kernel/sched/rt.c
@@ -949,6 +949,9 @@ static void update_curr_rt(struct rq *rq
 	if (unlikely((s64)delta_exec <= 0))
 		return;
 
+	/* Kick cpufreq (see the comment in linux/cpufreq.h). */
+	cpufreq_trigger_update(rq_clock(rq));
+
 	schedstat_set(curr->se.statistics.exec_max,
 		      max(curr->se.statistics.exec_max, delta_exec));
 
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -102,6 +102,51 @@ static LIST_HEAD(cpufreq_governor_list);
 static struct cpufreq_driver *cpufreq_driver;
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
 static DEFINE_RWLOCK(cpufreq_driver_lock);
+
+static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @data: New pointer value.
+ *
+ * Set and publish the update_util_data pointer for the given CPU.  That pointer
+ * points to a struct update_util_data object containing a callback function
+ * to call from cpufreq_update_util().  That function will be called from an RCU
+ * read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU callbacks to free any memory that might be accessed
+ * via the old update_util_data pointer or invoke synchronize_rcu() right after
+ * this function to avoid use-after-free.
+ */
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+{
+	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @time: Current time.
+ * @util: Current utilization.
+ * @max: Utilization ceiling.
+ *
+ * This function is called by the scheduler on every invocation of
+ * update_load_avg() on the CPU whose utilization is being updated.
+ */
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+	struct update_util_data *data;
+
+	rcu_read_lock();
+
+	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
+	if (data && data->func)
+		data->func(data, time, util, max);
+
+	rcu_read_unlock();
+}
+
 DEFINE_MUTEX(cpufreq_governor_lock);
 
 /* Flag to suspend/resume CPUFreq governors */

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 17:30                 ` Peter Zijlstra
@ 2016-02-11 17:34                   ` Rafael J. Wysocki
  2016-02-11 17:38                     ` Peter Zijlstra
  2016-02-11 18:52                   ` Steve Muckle
  1 sibling, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-11 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Thu, Feb 11, 2016 at 6:30 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 11, 2016 at 09:06:04AM -0800, Steve Muckle wrote:
>> Hi Peter,
>>
>> >> > I think additional hooks such as enqueue/dequeue would be needed in
>> >> > RT/DL.
>
> That is what I reacted to mostly. Enqueue/dequeue hooks don't really
> make much sense for RT / DL.
>
>> Rafael's changes aren't specifying particular frequencies/capacities in
>> the scheduler hooks. They're just pokes to get cpufreq to run, in order
>> to eliminate cpufreq's timers.
>>
>> My concern above is that pokes are guaranteed to keep occurring when
>> there is only RT or DL activity so nothing breaks.
>
> The hook in their respective tick handler should ensure stuff is called
> sporadically and isn't stalled.

I've updated the patch in the meantime
(https://patchwork.kernel.org/patch/8283431/).

Should I move the RT/DL hooks to task_tick_rt/dl(), respectively?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 17:34                   ` Rafael J. Wysocki
@ 2016-02-11 17:38                     ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2016-02-11 17:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Steve Muckle, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Thu, Feb 11, 2016 at 06:34:05PM +0100, Rafael J. Wysocki wrote:
> I've updated the patch in the meantime
> (https://patchwork.kernel.org/patch/8283431/).
> 
> Should I move the RT/DL hooks to task_tick_rt/dl(), respectively?

Probably, this really is about kicking cpufreq to do something, right?
update_curr_*() seems overkill for that.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 15:26                 ` Peter Zijlstra
@ 2016-02-11 18:23                   ` Vincent Guittot
  2016-02-12 14:04                     ` Peter Zijlstra
  0 siblings, 1 reply; 134+ messages in thread
From: Vincent Guittot @ 2016-02-11 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Steve Muckle, Rafael J. Wysocki, Rafael J. Wysocki,
	Linux PM list, Linux Kernel Mailing List, Srinivas Pandruvada,
	Viresh Kumar, Thomas Gleixner

On 11 February 2016 at 16:26, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 11, 2016 at 12:24:29PM +0000, Juri Lelli wrote:
>> Hi Peter,
>>
>> On 11/02/16 12:59, Peter Zijlstra wrote:
>> >
>> > No, for RT (RR/FIFO) we do not have enough information to do anything
>> > useful. Basically RR/FIFO should result in running 100% whenever we
>> > schedule such a task.
>> >
>> > That means RR/FIFO want a hook in pick_next_task_rt() to bump the freq
>> > to 100% and leave it there until something else gets to run.
>> >
>>
>> Vincent is trying to play with rt_avg (in the last sched-freq thread) to
>> see if we can get some information about RT as well. I understand that
>> from a theoretical perspective that's not much we can say of such tasks,
>> and bumping to max can be the only sensible thing to do, but there are
>> users of RT (ehm, Android) that will probably see differences in energy
>> consumption if we do so. Yeah, maybe the should use a different policy,
>> yes.
>
> Can't we just leave broken people get broken results? Trying to use
> rt_avg for this is just insane. We should ensure that people using this
> thing correctly get correct results, the rest can take a hike.
>
> Using rt_avg gets us to the place where people who want to do the right
> thing cannot, and that is bad.

I agree that using rt_avg is not the best choice to evaluate the
capacity that is used by RT tasks but it has the advantage of been
already there. Do you mean that we should use another way to compute
the capacity that is used by rt tasks to then select the frequency  ?
Or do you mean that we can't do anything else than asking for max
frequency ?

Trying to set max frequency just before scheduling RT task is not
really doable on a lot of platform because the sequence that changes
the frequency can sleep and takes more time than the run time of the
task. At the end, we will have set max frequency once the task has
finished to run. There is no other solution than increasing the
min_freq of cpufreq to a level that will ensure enough compute
capacity for RT task with such high constraints that cpufreq can't
react. For other RT tasks, we can probably found a way to set a
frequency that can fit both RT constraints and power consumption.

>
>> > For DL it basically wants to set a minimum freq based on reserved
>> > utilization, so that is __setparam_dl() or somewhere around there.
>> >
>>
>> I think we could do better than this once Luca's reclaiming stuff gets
>> in. The reserved bw is usually somewhat pessimistic. But this is a
>> different discussion, maybe.
>
> Sure, there's cleverer things that can be done. But a simple one would
> indeed be the min guarantee based on accepted bandwidth.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 17:30                 ` Peter Zijlstra
  2016-02-11 17:34                   ` Rafael J. Wysocki
@ 2016-02-11 18:52                   ` Steve Muckle
  2016-02-11 19:04                     ` Rafael J. Wysocki
  2016-02-12 14:10                     ` Peter Zijlstra
  1 sibling, 2 replies; 134+ messages in thread
From: Steve Muckle @ 2016-02-11 18:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>> My concern above is that pokes are guaranteed to keep occurring when
>> > there is only RT or DL activity so nothing breaks.
>
> The hook in their respective tick handler should ensure stuff is called
> sporadically and isn't stalled.

But that's only true if the RT/DL tasks happen to be running when the
tick arrives right?

Couldn't we have RT/DL activity which doesn't overlap with the tick? And
if no CFS tasks happen to be executing on that CPU, we'll never trigger
the cpufreq update. This could go on for an arbitrarily long time
depending on the periodicity of the work.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 18:52                   ` Steve Muckle
@ 2016-02-11 19:04                     ` Rafael J. Wysocki
  2016-02-12 13:43                       ` Rafael J. Wysocki
  2016-02-12 14:10                     ` Peter Zijlstra
  1 sibling, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-11 19:04 UTC (permalink / raw)
  To: Steve Muckle, Peter Zijlstra
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Thu, Feb 11, 2016 at 7:52 PM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>> My concern above is that pokes are guaranteed to keep occurring when
>>> > there is only RT or DL activity so nothing breaks.
>>
>> The hook in their respective tick handler should ensure stuff is called
>> sporadically and isn't stalled.
>
> But that's only true if the RT/DL tasks happen to be running when the
> tick arrives right?
>
> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
> if no CFS tasks happen to be executing on that CPU, we'll never trigger
> the cpufreq update. This could go on for an arbitrarily long time
> depending on the periodicity of the work.

I'm thinking that two additional hooks in enqueue_task_rt/dl() might
help here.  Then, we will hit either the tick or enqueue and that
should do the trick.

Peter, what do you think?

Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 12:08             ` Rafael J. Wysocki
  2016-02-11 15:29               ` Peter Zijlstra
@ 2016-02-11 20:47               ` Rafael J. Wysocki
  1 sibling, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-11 20:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Steve Muckle, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner, Doug Smythies

On Thu, Feb 11, 2016 at 1:08 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Thu, Feb 11, 2016 at 12:51 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, Feb 09, 2016 at 09:05:05PM +0100, Rafael J. Wysocki wrote:
>>> > > One concern I had was, given that the lone scheduler update hook is in
>>> > > CFS, is it possible for governor updates to be stalled due to RT or DL
>>> > > task activity?
>>> >
>>> > I don't think they may be completely stalled, but I'd prefer Peter to
>>> > answer that as he suggested to do it this way.
>>>
>>> In any case, if that concern turns out to be significant in practice, it may
>>> be addressed like in the appended modification of patch [1/3] from the $subject
>>> series.
>>>
>>> With that things look like before from the cpufreq side, but the other sched
>>> classes also get a chance to trigger a cpufreq update.  The drawback is the
>>> cpu_clock() call instead of passing the time value from update_load_avg(), but
>>> I guess we can live with that if necessary.
>>>
>>> FWIW, this modification doesn't seem to break things on my test machine.
>>
>> Not really pretty though. It blows a bit that you require this callback
>> to be periodic (in order to replace a timer).
>
> We need it for now, but that's because of how things work on the cpufreq side.

In fact, I don't need the new callback to be invoked periodically.  I
only need it to be called often enough, where "enough" means at least
once in every sampling interval (for the lack of a better name) on the
rough average.  Less often than that may be kind of OK too depending
on the case.

I guess I need to explain that in more detail, though, at least for
the record if not anything else, so let me do that.

To start with let me note that things in cpufreq don't happen
periodically even today with timers, because all of those timers are
deferrable, so you never know when you'll get the next update
realistically.  We try to compensate for that in a kind of poor man's
way (which may be a source of problems by itself as mentioned by
Doug), but that's a band-aid rather.

With that in mind, there are two cases, the intel_pstate case and the
ondemand/conservative governor case.

intel_pstate is simpler, because it can do everything it needs in the
new callback (or in a timer function previously).  Periodicity might
matter to it, but it only uses two last points in its computations,
the current one and the previous one.  Thus it is not that important
how long the particular interval is.  Of course, if it is way too
long, we may miss some intermediate peaks and valleys and if the peaks
are intermittent enough, people may see poor performance.  In
practice, though, it turns out that the new callback is invoked (even
from CFS alone) much more frequently than we need on the average, so
we apply a "sample delay" rate limit to it.

In turn, the ondemand/conservative governor case is outright
ridiculous, because they don't even compute anything in the callback
(or a timer function previously).  They simply use it to spawn a work
item in process context that will estimate the "utilization" and
possibly change the P-state.  That may be delayed by the scheduling
interval, then pushed back by RT tasks and so on, so the time between
the moment they decide to take a "sample" and the moment that actually
happens may be, well, arbitrary.  So really timers are used here to
poke at things on a regular basis rather than for any actually
periodic stuff.

That may be improved in two ways in principle.  First, by moving as
much as we can into the utilization update callback without adding too
much overhead to the scheduler path.  Governor computations are the
primary candidate for that.  They need to take all of the tunables
accessible from user space into account, but that shouldn't be a big
problem.  We may be able to call at least some drivers from there too
(even the ACPI driver may be able to switch P-states via register
writes in some cases).  The second way would be to use the utilization
numbers provided by the scheduler for making governor decisions.

If we can do both, we should be much better off than we are today
already, even without the EAS stuff.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* RE: [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-10 23:17     ` Rafael J. Wysocki
@ 2016-02-11 22:50       ` Doug Smythies
  2016-02-11 23:28         ` Rafael J. Wysocki
  2016-02-12  7:25         ` Doug Smythies
  0 siblings, 2 replies; 134+ messages in thread
From: Doug Smythies @ 2016-02-11 22:50 UTC (permalink / raw)
  To: 'Rafael J. Wysocki', 'Srinivas Pandruvada'
  Cc: 'Linux PM list', 'Ingo Molnar',
	'Linux Kernel Mailing List', 'Peter Zijlstra',
	'Viresh Kumar', 'Juri Lelli',
	'Steve Muckle', 'Thomas Gleixner'

On 2016.02.10 15:18 Rafael J. Wysocki wrote:
> On Wednesday, February 10, 2016 03:11:43 PM Doug Smythies wrote:
>> On 2016.02.10 07:17 Rafael J. Wysocki wrote:
>>> On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
>>>> 
>> This patch set solves a long standing issue with the intel_pstate driver.

> Good to hear that, thanks!

>> The issue began with the introduction of the "duration" method for deciding
>> if the CPU had been idle for a long time resulting in forcing the
>> target pstate downwards. Often this was the correct action, but sometimes this
>> was the wrong thing to do, because the cpu was actually very busy, but just so
>> happened to be idle on jiffy boundaries (perhaps similar to what Steve Muckle
>> was referring to on another branch of this thread).

>> I have a bunch of graphs, if anyone wants to see the supporting data.

> It would be good to see how the data with and without the patchset compare
> to each other if you have that.

Please see:
double u double u double u dot smythies dot com /~doug/linux/intel_pstate/rjw_patch_set/index.html

Specific duration tests graphs are posted, and also a bunch of idle tests graphs are posted.
The references section includes links to all raw and post processed data.

Note that on my 2 hour idle tests, I had a few 300 second durations
on CPU 6 with the v5 patch set.
(likely what Steve Muckle was referring to.)
Such long durations did not occur in v6 or v7 2 hour idle tests.

Very interesting patterns in the 2 hour idle tests durations for
individual CPUs.

On 2016.02.10 22:03 Srinivas Pandruvada wrote:

>> My test computer has an older model i7 (Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz)
> Thanks Doug. If you have specific workloads, please compare performance.

My work so far has been testing functionality, with unrealistic workloads specifically
designed to exaggerate issues, in this case the duration problem.

I'll look at some real world workload scenarios.

What I do have from my 2 hour idle tests is the of total number of passes through
the intel_pstate driver:

Control sample: Kernel 4.3-rc3: 37949 passes.
Kernel 4.3-rc3 + rjw 3 patch set v5: 180355 passes
Kernel 4.3-rc3 + rjw 3 patch set v6: 201307 passes
Kernel 4.3-rc3 + rjw 3 patch set v7: 203619 passes

While I should have, I did not run turbostat to get idle energy and/or power.
However, a 1 hour idle test with turbostat gave (Package Joules):
Control sample: Kernel 4.3-rc3: 13788 J or 3.83 Watts
Kernel 4.3-rc3 + rjw 3 patch set v7: 13929 J or 3.87 Watts

... Doug

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 22:50       ` Doug Smythies
@ 2016-02-11 23:28         ` Rafael J. Wysocki
  2016-02-12  1:02           ` Doug Smythies
  2016-02-12  7:25         ` Doug Smythies
  1 sibling, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-11 23:28 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Linux PM list,
	Ingo Molnar, Linux Kernel Mailing List, Peter Zijlstra,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

Hi Doug,

On Thu, Feb 11, 2016 at 11:50 PM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2016.02.10 15:18 Rafael J. Wysocki wrote:
>> On Wednesday, February 10, 2016 03:11:43 PM Doug Smythies wrote:
>>> On 2016.02.10 07:17 Rafael J. Wysocki wrote:
>>>> On Friday, January 29, 2016 11:52:15 PM Rafael J. Wysocki wrote:
>>>>>
>>> This patch set solves a long standing issue with the intel_pstate driver.
>
>> Good to hear that, thanks!
>
>>> The issue began with the introduction of the "duration" method for deciding
>>> if the CPU had been idle for a long time resulting in forcing the
>>> target pstate downwards. Often this was the correct action, but sometimes this
>>> was the wrong thing to do, because the cpu was actually very busy, but just so
>>> happened to be idle on jiffy boundaries (perhaps similar to what Steve Muckle
>>> was referring to on another branch of this thread).
>
>>> I have a bunch of graphs, if anyone wants to see the supporting data.
>
>> It would be good to see how the data with and without the patchset compare
>> to each other if you have that.
>
> Please see:
> double u double u double u dot smythies dot com /~doug/linux/intel_pstate/rjw_patch_set/index.html

Thanks for the data.

> Specific duration tests graphs are posted, and also a bunch of idle tests graphs are posted.
> The references section includes links to all raw and post processed data.
>
> Note that on my 2 hour idle tests, I had a few 300 second durations
> on CPU 6 with the v5 patch set.
> (likely what Steve Muckle was referring to.)
> Such long durations did not occur in v6 or v7 2 hour idle tests.

OK, that suggests that using rq_lock(rq) in patch [1/3] is a win.

> Very interesting patterns in the 2 hour idle tests durations for
> individual CPUs.
>
> On 2016.02.10 22:03 Srinivas Pandruvada wrote:
>
>>> My test computer has an older model i7 (Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz)
>> Thanks Doug. If you have specific workloads, please compare performance.
>
> My work so far has been testing functionality, with unrealistic workloads specifically
> designed to exaggerate issues, in this case the duration problem.
>
> I'll look at some real world workload scenarios.
>
> What I do have from my 2 hour idle tests is the of total number of passes through
> the intel_pstate driver:
>
> Control sample: Kernel 4.3-rc3: 37949 passes.
> Kernel 4.3-rc3 + rjw 3 patch set v5: 180355 passes
> Kernel 4.3-rc3 + rjw 3 patch set v6: 201307 passes
> Kernel 4.3-rc3 + rjw 3 patch set v7: 203619 passes

That reflects how things work with the changes.  The driver is called
more often now and has to decide whether or not to take a sample.

It would be interesting to see how many of those were samples that
were actually taken if you can instrument that.

> While I should have, I did not run turbostat to get idle energy and/or power.
> However, a 1 hour idle test with turbostat gave (Package Joules):
> Control sample: Kernel 4.3-rc3: 13788 J or 3.83 Watts
> Kernel 4.3-rc3 + rjw 3 patch set v7: 13929 J or 3.87 Watts

So it shows a slight increase in energy consumption with your
workloads.  It is not as much as to make me worry in any way, but I'm
wondering if performance is better too as a result (and how much
better if so).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* RE: [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 23:28         ` Rafael J. Wysocki
@ 2016-02-12  1:02           ` Doug Smythies
  2016-02-12  1:20             ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Doug Smythies @ 2016-02-12  1:02 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Rafael J. Wysocki', 'Srinivas Pandruvada',
	'Linux PM list', 'Ingo Molnar',
	'Linux Kernel Mailing List', 'Peter Zijlstra',
	'Viresh Kumar', 'Juri Lelli',
	'Steve Muckle', 'Thomas Gleixner'

On 2016.02.11 15:28 Rafael J. Wysocki wrote:
> On 2106.02.11 14:50 Doug Smythies wrote:

>> What I do have from my 2 hour idle tests is the of total number of passes through
>> the intel_pstate driver:
>>
>> Control sample: Kernel 4.3-rc3: 37949 passes.
>> Kernel 4.3-rc3 + rjw 3 patch set v5: 180355 passes
>> Kernel 4.3-rc3 + rjw 3 patch set v6: 201307 passes
>> Kernel 4.3-rc3 + rjw 3 patch set v7: 203619 passes

> That reflects how things work with the changes.  The driver is called
> more often now and has to decide whether or not to take a sample.

Opps. I didn't understand that point, and so only now looked more
closely at the code.

> It would be interesting to see how many of those were samples that
> were actually taken if you can instrument that.

So, those are samples that were taken. There is no trace information
acquired when the new code decides not to take a sample (or so is my
understanding from a quick look).

I did find a couple of cases where the duration (elapsed time between
samples on a given CPU) was less than the nominal sample time. The search
was not exhaustive. (Likely O.K. within expected jitter, just noting
is all. The post processing tools use the kernel clock to do the
calculation, as the duration calculated by the driver is not in the trace
data.)

2 hour idle test: v5 patch 9.955 mSec sample 10078 CPU 1
2 hour idle test: v7 patch 9.968 mSec sample 49476 CPU 3
Duration load test: v7 patch 9.982 mSec sample 10997 CPU 2

... Doug

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12  1:02           ` Doug Smythies
@ 2016-02-12  1:20             ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-12  1:20 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Srinivas Pandruvada,
	Linux PM list, Ingo Molnar, Linux Kernel Mailing List,
	Peter Zijlstra, Viresh Kumar, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On Fri, Feb 12, 2016 at 2:02 AM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2016.02.11 15:28 Rafael J. Wysocki wrote:
>> On 2106.02.11 14:50 Doug Smythies wrote:
>
>>> What I do have from my 2 hour idle tests is the of total number of passes through
>>> the intel_pstate driver:
>>>
>>> Control sample: Kernel 4.3-rc3: 37949 passes.
>>> Kernel 4.3-rc3 + rjw 3 patch set v5: 180355 passes
>>> Kernel 4.3-rc3 + rjw 3 patch set v6: 201307 passes
>>> Kernel 4.3-rc3 + rjw 3 patch set v7: 203619 passes
>
>> That reflects how things work with the changes.  The driver is called
>> more often now and has to decide whether or not to take a sample.
>
> Opps. I didn't understand that point, and so only now looked more
> closely at the code.
>
>> It would be interesting to see how many of those were samples that
>> were actually taken if you can instrument that.
>
> So, those are samples that were taken. There is no trace information
> acquired when the new code decides not to take a sample (or so is my
> understanding from a quick look).

That's correct.  The trace only covers the samples that were actually taken.

> I did find a couple of cases where the duration (elapsed time between
> samples on a given CPU) was less than the nominal sample time. The search
> was not exhaustive. (Likely O.K. within expected jitter, just noting
> is all. The post processing tools use the kernel clock to do the
> calculation, as the duration calculated by the driver is not in the trace
> data.)
>
> 2 hour idle test: v5 patch 9.955 mSec sample 10078 CPU 1
> 2 hour idle test: v7 patch 9.968 mSec sample 49476 CPU 3
> Duration load test: v7 patch 9.982 mSec sample 10997 CPU 2

OK, so the order of magnitude looks reasonable at least.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* RE: [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 22:50       ` Doug Smythies
  2016-02-11 23:28         ` Rafael J. Wysocki
@ 2016-02-12  7:25         ` Doug Smythies
  2016-02-12 13:39           ` Rafael J. Wysocki
  1 sibling, 1 reply; 134+ messages in thread
From: Doug Smythies @ 2016-02-12  7:25 UTC (permalink / raw)
  To: 'Rafael J. Wysocki', 'Srinivas Pandruvada'
  Cc: 'Linux PM list', 'Ingo Molnar',
	'Linux Kernel Mailing List', 'Peter Zijlstra',
	'Viresh Kumar', 'Juri Lelli',
	'Steve Muckle', 'Thomas Gleixner'

On 2016.02.11 14:50 Doug Smythies wrote:
> On 2016.02.10 22:03 Srinivas Pandruvada wrote:
>> On Wednesday, February 10, 2016 03:11:43 PM Doug Smythies wrote:

>>> My test computer has an older model i7 (Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz)
>> Thanks Doug. If you have specific workloads, please compare performance.

> My work so far has been testing functionality, with unrealistic workloads specifically
> designed to exaggerate issues, in this case the duration problem.
>
> I'll look at some real world workload scenarios.

Turbostat used for package power, starts before Phoronix tests starts,
and ends after Phoronix test ends.

Control Sample: Kernel 4.5-rc3:
Phoronix ffmpeg: turbostat 180 Sec. 12.07 Sec. Ave. 27.14 Watts.
Phoronix apache: turbostat 200 Sec. 19797.0 R.P.S. Ave. 34.01 Watts.
Phoronix kernel: turbostat 180 Sec. 139.93 Sec. 49.09 Watts.
Phoronix Postmark (Disk Test): turbostat 200 Sec. 5813 T.P.S. Ave. 21.33 Watts.

Kernel 4.5-rc3 + RJW 3 patch set version 7:
Phoronix ffmpeg: turbostat 180 Sec. 11.67 Sec. Ave. 27.35 Watts.
Phoronix apache: turbostat 200 Sec. 19430.7 R.P.S. Ave. 34.18 Watts.
Phoronix kernel: turbostat 180 Sec. 139.81 Sec. 48.80 Watts.
Phoronix Postmark (Disk Test): turbostat 200 Sec. 5683 T.P.S. Ave. 22.41 Watts.

... Doug

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH v9 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-11 17:30       ` [PATCH v8 " Rafael J. Wysocki
@ 2016-02-12 13:16         ` Rafael J. Wysocki
  2016-02-15 21:47           ` [PATCH v10 " Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-12 13:16 UTC (permalink / raw)
  To: Linux PM list, Peter Zijlstra
  Cc: Ingo Molnar, Linux Kernel Mailing List, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] cpufreq: Add mechanism for registering utilization update callbacks

Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.

This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.

The update_load_avg() changes were suggested by Peter Zijlstra.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
---

Peter,

If the enqueue hooks aren't tolerable and I should drop them, please let me
know.

Changes from v8:
- Peter thinks that cpufreq hooks in update_curr_rt/dl() are overkill so
  move them to task_tick_rt/dl() and enqueue_task_rt/dl() (in case RT/DL
  tasks are only active between ticks), update the cpufreq_trigger_update()
  kerneldoc.

Changes from v7
- cpufreq_trigger_update() has a kerneldoc describing it as a band-aid to
  be replaced in the future and the comments next to its call sites ask
  the reader to see that comment.

  No functional changes. 

Changes from v6:
- Steve suggested to use rq_clock() instead of rq_clock_task() as the time
  argument for cpufreq_update_util() as that seems to be more suitable for
  this purpose.

Thanks,
Rafael

---
 drivers/cpufreq/cpufreq.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   |   37 +++++++++++++++++++++++++++++++++++++
 kernel/sched/deadline.c   |    6 ++++++
 kernel/sched/fair.c       |   26 +++++++++++++++++++++++++-
 kernel/sched/rt.c         |    6 ++++++
 kernel/sched/sched.h      |    1 +
 6 files changed, 120 insertions(+), 1 deletion(-)

Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -151,6 +151,39 @@ static inline bool policy_is_shared(stru
 extern struct kobject *cpufreq_global_kobject;
 
 #ifdef CONFIG_CPU_FREQ
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+
+/**
+ * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
+ * @time: Current time.
+ *
+ * The way cpufreq is currently arranged requires it to evaluate the CPU
+ * performance state (frequency/voltage) on a regular basis to prevent it from
+ * being stuck in a completely inadequate performance level for too long.
+ * That is not guaranteed to happen if the updates are only triggered from CFS,
+ * though, because they may not be coming in if RT or deadline tasks are active
+ * all the time (or there are RT and DL tasks only).
+ *
+ * As a workaround for that issue, this function is called by the RT and DL
+ * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * but that really is a band-aid.  Going forward it should be replaced with
+ * solutions targeted more specifically at RT and DL tasks.
+ *
+ * The extra updates are triggered from the tick and enqueue (in case RT/DL
+ * tasks are only active between ticks).
+ */
+static inline void cpufreq_trigger_update(u64 time)
+{
+	cpufreq_update_util(time, ULONG_MAX, 0);
+}
+
+struct update_util_data {
+	void (*func)(struct update_util_data *data,
+		     u64 time, unsigned long util, unsigned long max);
+};
+
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+
 unsigned int cpufreq_get(unsigned int cpu);
 unsigned int cpufreq_quick_get(unsigned int cpu);
 unsigned int cpufreq_quick_get_max(unsigned int cpu);
@@ -162,6 +195,10 @@ int cpufreq_update_policy(unsigned int c
 bool have_governor_per_policy(void);
 struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
 #else
+static inline void cpufreq_update_util(u64 time, unsigned long util,
+				       unsigned long max) {}
+static inline void cpufreq_trigger_update(u64 time) {}
+
 static inline unsigned int cpufreq_get(unsigned int cpu)
 {
 	return 0;
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -9,6 +9,7 @@
 #include <linux/irq_work.h>
 #include <linux/tick.h>
 #include <linux/slab.h>
+#include <linux/cpufreq.h>
 
 #include "cpupri.h"
 #include "cpudeadline.h"
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2824,7 +2824,8 @@ static inline void update_load_avg(struc
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
-	int cpu = cpu_of(rq_of(cfs_rq));
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -2836,6 +2837,29 @@ static inline void update_load_avg(struc
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
+
+	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+		unsigned long max = rq->cpu_capacity_orig;
+
+		/*
+		 * There are a few boundary cases this might miss but it should
+		 * get called often enough that that should (hopefully) not be
+		 * a real problem -- added to that it only calls on the local
+		 * CPU, so if we enqueue remotely we'll miss an update, but
+		 * the next tick/schedule should update.
+		 *
+		 * It will not get called when we go idle, because the idle
+		 * thread is a different class (!fair), nor will the utilization
+		 * number include things like RT tasks.
+		 *
+		 * As is, the util number is not freq-invariant (we'd have to
+		 * implement arch_scale_freq_capacity() for that).
+		 *
+		 * See cpu_util().
+		 */
+		cpufreq_update_util(rq_clock(rq),
+				    min(cfs_rq->avg.util_avg, max), max);
+	}
 }
 
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
Index: linux-pm/kernel/sched/deadline.c
===================================================================
--- linux-pm.orig/kernel/sched/deadline.c
+++ linux-pm/kernel/sched/deadline.c
@@ -935,6 +935,9 @@ static void enqueue_task_dl(struct rq *r
 	struct task_struct *pi_task = rt_mutex_get_top_task(p);
 	struct sched_dl_entity *pi_se = &p->dl;
 
+	/* Kick cpufreq (see the comment in linux/cpufreq.h). */
+	cpufreq_trigger_update(rq_clock(rq));
+
 	/*
 	 * Use the scheduling parameters of the top pi-waiter
 	 * task if we have one and its (absolute) deadline is
@@ -1205,6 +1208,9 @@ static void task_tick_dl(struct rq *rq,
 	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0 &&
 	    is_leftmost(p, &rq->dl))
 		start_hrtick_dl(rq, p);
+
+	/* Kick cpufreq (see the comment in linux/cpufreq.h). */
+	cpufreq_trigger_update(rq_clock(rq));
 }
 
 static void task_fork_dl(struct task_struct *p)
Index: linux-pm/kernel/sched/rt.c
===================================================================
--- linux-pm.orig/kernel/sched/rt.c
+++ linux-pm/kernel/sched/rt.c
@@ -1257,6 +1257,9 @@ enqueue_task_rt(struct rq *rq, struct ta
 {
 	struct sched_rt_entity *rt_se = &p->rt;
 
+	/* Kick cpufreq (see the comment in linux/cpufreq.h). */
+	cpufreq_trigger_update(rq_clock(rq));
+
 	if (flags & ENQUEUE_WAKEUP)
 		rt_se->timeout = 0;
 
@@ -2214,6 +2217,9 @@ static void task_tick_rt(struct rq *rq,
 
 	watchdog(rq, p);
 
+	/* Kick cpufreq (see the comment in linux/cpufreq.h). */
+	cpufreq_trigger_update(rq_clock(rq));
+
 	/*
 	 * RR tasks need a special form of timeslice management.
 	 * FIFO tasks have no timeslices.
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -102,6 +102,51 @@ static LIST_HEAD(cpufreq_governor_list);
 static struct cpufreq_driver *cpufreq_driver;
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
 static DEFINE_RWLOCK(cpufreq_driver_lock);
+
+static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @data: New pointer value.
+ *
+ * Set and publish the update_util_data pointer for the given CPU.  That pointer
+ * points to a struct update_util_data object containing a callback function
+ * to call from cpufreq_update_util().  That function will be called from an RCU
+ * read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU callbacks to free any memory that might be accessed
+ * via the old update_util_data pointer or invoke synchronize_rcu() right after
+ * this function to avoid use-after-free.
+ */
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+{
+	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @time: Current time.
+ * @util: Current utilization.
+ * @max: Utilization ceiling.
+ *
+ * This function is called by the scheduler on every invocation of
+ * update_load_avg() on the CPU whose utilization is being updated.
+ */
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+	struct update_util_data *data;
+
+	rcu_read_lock();
+
+	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
+	if (data && data->func)
+		data->func(data, time, util, max);
+
+	rcu_read_unlock();
+}
+
 DEFINE_MUTEX(cpufreq_governor_lock);
 
 /* Flag to suspend/resume CPUFreq governors */

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12  7:25         ` Doug Smythies
@ 2016-02-12 13:39           ` Rafael J. Wysocki
  2016-02-12 17:33             ` Doug Smythies
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-12 13:39 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Linux PM list,
	Ingo Molnar, Linux Kernel Mailing List, Peter Zijlstra,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

On Fri, Feb 12, 2016 at 8:25 AM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2016.02.11 14:50 Doug Smythies wrote:
>> On 2016.02.10 22:03 Srinivas Pandruvada wrote:
>>> On Wednesday, February 10, 2016 03:11:43 PM Doug Smythies wrote:
>
>>>> My test computer has an older model i7 (Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz)
>>> Thanks Doug. If you have specific workloads, please compare performance.
>
>> My work so far has been testing functionality, with unrealistic workloads specifically
>> designed to exaggerate issues, in this case the duration problem.
>>
>> I'll look at some real world workload scenarios.
>
> Turbostat used for package power, starts before Phoronix tests starts,
> and ends after Phoronix test ends.
>
> Control Sample: Kernel 4.5-rc3:
> Phoronix ffmpeg: turbostat 180 Sec. 12.07 Sec. Ave. 27.14 Watts.
> Phoronix apache: turbostat 200 Sec. 19797.0 R.P.S. Ave. 34.01 Watts.
> Phoronix kernel: turbostat 180 Sec. 139.93 Sec. 49.09 Watts.
> Phoronix Postmark (Disk Test): turbostat 200 Sec. 5813 T.P.S. Ave. 21.33 Watts.
>
> Kernel 4.5-rc3 + RJW 3 patch set version 7:
> Phoronix ffmpeg: turbostat 180 Sec. 11.67 Sec. Ave. 27.35 Watts.
> Phoronix apache: turbostat 200 Sec. 19430.7 R.P.S. Ave. 34.18 Watts.
> Phoronix kernel: turbostat 180 Sec. 139.81 Sec. 48.80 Watts.
> Phoronix Postmark (Disk Test): turbostat 200 Sec. 5683 T.P.S. Ave. 22.41 Watts.

Thanks for the results!

The Postmark result is somewhat below expectations (especially with
respect to the energy consumption), but we should be able to improve
that by using the util numbers intelligently.

Do you have full turbostat reports from those runs by any chance?  I'm
wondering what happens to the idle state residencies, for example.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 19:04                     ` Rafael J. Wysocki
@ 2016-02-12 13:43                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-12 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Thu, Feb 11, 2016 at 8:04 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Thu, Feb 11, 2016 at 7:52 PM, Steve Muckle <steve.muckle@linaro.org> wrote:
>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>>> My concern above is that pokes are guaranteed to keep occurring when
>>>> > there is only RT or DL activity so nothing breaks.
>>>
>>> The hook in their respective tick handler should ensure stuff is called
>>> sporadically and isn't stalled.
>>
>> But that's only true if the RT/DL tasks happen to be running when the
>> tick arrives right?
>>
>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>> the cpufreq update. This could go on for an arbitrarily long time
>> depending on the periodicity of the work.
>
> I'm thinking that two additional hooks in enqueue_task_rt/dl() might
> help here.  Then, we will hit either the tick or enqueue and that
> should do the trick.
>
> Peter, what do you think?

In any case I posted a v9 with those changes
(https://patchwork.kernel.org/patch/8290791/).

Again, it doesn't appear to break things.

If the enqueue hooks are bad (unwanted at all or in wrong places),
please let me know.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 18:23                   ` Vincent Guittot
@ 2016-02-12 14:04                     ` Peter Zijlstra
  2016-02-12 14:48                       ` Vincent Guittot
  0 siblings, 1 reply; 134+ messages in thread
From: Peter Zijlstra @ 2016-02-12 14:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Juri Lelli, Steve Muckle, Rafael J. Wysocki, Rafael J. Wysocki,
	Linux PM list, Linux Kernel Mailing List, Srinivas Pandruvada,
	Viresh Kumar, Thomas Gleixner

On Thu, Feb 11, 2016 at 07:23:55PM +0100, Vincent Guittot wrote:
> I agree that using rt_avg is not the best choice to evaluate the
> capacity that is used by RT tasks but it has the advantage of been
> already there. Do you mean that we should use another way to compute
> the capacity that is used by rt tasks to then select the frequency  ?

Nope, RR/FIFO simply do not contain enough information to compute
anything from.

> Or do you mean that we can't do anything else than asking for max
> frequency ?

Yep.

> Trying to set max frequency just before scheduling RT task is not
> really doable on a lot of platform because the sequence that changes
> the frequency can sleep and takes more time than the run time of the
> task.

So what people do today is shoot cpufreq in the head and not use it,
maybe that's the 'right' thing on these platforms.

> At the end, we will have set max frequency once the task has
> finished to run. There is no other solution than increasing the
> min_freq of cpufreq to a level that will ensure enough compute
> capacity for RT task with such high constraints that cpufreq can't
> react.

But you cannot a priori tell how much time RR/FIFO tasks will require,
that's the entire problem with them. We can compute a hysterical
average, but that _will_ mis predict the future and get you
underruns/deadline misses.

> For other RT tasks, we can probably found a way to set a
> frequency that can fit both RT constraints and power consumption.

You cannot, not without adding a lot more information about what these
tasks are doing, and that is not captured in the task model.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-11 18:52                   ` Steve Muckle
  2016-02-11 19:04                     ` Rafael J. Wysocki
@ 2016-02-12 14:10                     ` Peter Zijlstra
  2016-02-12 16:01                       ` Rafael J. Wysocki
  1 sibling, 1 reply; 134+ messages in thread
From: Peter Zijlstra @ 2016-02-12 14:10 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
> >> My concern above is that pokes are guaranteed to keep occurring when
> >> > there is only RT or DL activity so nothing breaks.
> >
> > The hook in their respective tick handler should ensure stuff is called
> > sporadically and isn't stalled.
> 
> But that's only true if the RT/DL tasks happen to be running when the
> tick arrives right?
> 
> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
> if no CFS tasks happen to be executing on that CPU, we'll never trigger
> the cpufreq update. This could go on for an arbitrarily long time
> depending on the periodicity of the work.

Possible yes, but why do we care? Such a CPU would be so much idle that
cpufreq doesn't matter one way or another, right?

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12 14:04                     ` Peter Zijlstra
@ 2016-02-12 14:48                       ` Vincent Guittot
  2016-03-01 13:58                         ` Peter Zijlstra
  0 siblings, 1 reply; 134+ messages in thread
From: Vincent Guittot @ 2016-02-12 14:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Steve Muckle, Rafael J. Wysocki, Rafael J. Wysocki,
	Linux PM list, Linux Kernel Mailing List, Srinivas Pandruvada,
	Viresh Kumar, Thomas Gleixner

On 12 February 2016 at 15:04, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 11, 2016 at 07:23:55PM +0100, Vincent Guittot wrote:
>> I agree that using rt_avg is not the best choice to evaluate the
>> capacity that is used by RT tasks but it has the advantage of been
>> already there. Do you mean that we should use another way to compute
>> the capacity that is used by rt tasks to then select the frequency  ?
>
> Nope, RR/FIFO simply do not contain enough information to compute
> anything from.
>
>> Or do you mean that we can't do anything else than asking for max
>> frequency ?
>
> Yep.
>
>> Trying to set max frequency just before scheduling RT task is not
>> really doable on a lot of platform because the sequence that changes
>> the frequency can sleep and takes more time than the run time of the
>> task.
>
> So what people do today is shoot cpufreq in the head and not use it,
> maybe that's the 'right' thing on these platforms.
>
>> At the end, we will have set max frequency once the task has
>> finished to run. There is no other solution than increasing the
>> min_freq of cpufreq to a level that will ensure enough compute
>> capacity for RT task with such high constraints that cpufreq can't
>> react.
>
> But you cannot a priori tell how much time RR/FIFO tasks will require,
> that's the entire problem with them. We can compute a hysterical
> average, but that _will_ mis predict the future and get you
> underruns/deadline misses.
>
>> For other RT tasks, we can probably found a way to set a
>> frequency that can fit both RT constraints and power consumption.
>
> You cannot, not without adding a lot more information about what these
> tasks are doing, and that is not captured in the task model.

Another point to take into account is that the RT tasks will "steal"
the compute capacity that has been requested by the cfs tasks.

Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
and B and 1 cfs task C.
Let assume that the real time constraint of RT task A is too agressive
for the lowest OPP0 and that the change of the frequency of the core
is too slow compare to this constraint but the real time constraint of
RT task B can be handle whatever the OPP. System don't have other
choice than setting the cpufreq min freq to OPP1 to be sure that
constraint of task A will be covered at anytime. Then, we still have 2
possible OPPs. The CFS task asks for compute capacity that fits in
OPP1 but a part of this capacity will be stolen by RT tasks. If we
monitor the load of RT tasks and request capacity for these RT tasks
according to their current utilization, we can decide to switch to
highest OPP2 to ensure that task C will have enough remaining
capacity. A lot of embedded platform faces such kind of use cases

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12 14:10                     ` Peter Zijlstra
@ 2016-02-12 16:01                       ` Rafael J. Wysocki
  2016-02-12 16:15                         ` Rafael J. Wysocki
  2016-02-12 17:02                         ` Doug Smythies
  0 siblings, 2 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-12 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>> >> My concern above is that pokes are guaranteed to keep occurring when
>> >> > there is only RT or DL activity so nothing breaks.
>> >
>> > The hook in their respective tick handler should ensure stuff is called
>> > sporadically and isn't stalled.
>>
>> But that's only true if the RT/DL tasks happen to be running when the
>> tick arrives right?
>>
>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>> the cpufreq update. This could go on for an arbitrarily long time
>> depending on the periodicity of the work.
>
> Possible yes, but why do we care? Such a CPU would be so much idle that
> cpufreq doesn't matter one way or another, right?

Well, in theory you can get 50% or so of the time active in bursts
that happen to fit between ticks.  If we happen to do those in the
lowest P-state, we may burn more energy than necessary on platforms
where more idle is preferred.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12 16:01                       ` Rafael J. Wysocki
@ 2016-02-12 16:15                         ` Rafael J. Wysocki
  2016-02-12 16:53                           ` Ashwin Chaugule
  2016-02-12 17:02                         ` Doug Smythies
  1 sibling, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-12 16:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On Fri, Feb 12, 2016 at 5:01 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>> >> My concern above is that pokes are guaranteed to keep occurring when
>>> >> > there is only RT or DL activity so nothing breaks.
>>> >
>>> > The hook in their respective tick handler should ensure stuff is called
>>> > sporadically and isn't stalled.
>>>
>>> But that's only true if the RT/DL tasks happen to be running when the
>>> tick arrives right?
>>>
>>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>>> the cpufreq update. This could go on for an arbitrarily long time
>>> depending on the periodicity of the work.
>>
>> Possible yes, but why do we care? Such a CPU would be so much idle that
>> cpufreq doesn't matter one way or another, right?
>
> Well, in theory you can get 50% or so of the time active in bursts
> that happen to fit between ticks.  If we happen to do those in the
> lowest P-state, we may burn more energy than necessary on platforms
> where more idle is preferred.

At least intel_pstate should be able to figure out which P-state to
use then on the APERF/MPERF basis.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12 16:15                         ` Rafael J. Wysocki
@ 2016-02-12 16:53                           ` Ashwin Chaugule
  2016-02-12 23:14                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Ashwin Chaugule @ 2016-02-12 16:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Steve Muckle, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Thomas Gleixner

On 12 February 2016 at 11:15, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Feb 12, 2016 at 5:01 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>>>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>>> >> My concern above is that pokes are guaranteed to keep occurring when
>>>> >> > there is only RT or DL activity so nothing breaks.
>>>> >
>>>> > The hook in their respective tick handler should ensure stuff is called
>>>> > sporadically and isn't stalled.
>>>>
>>>> But that's only true if the RT/DL tasks happen to be running when the
>>>> tick arrives right?
>>>>
>>>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>>>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>>>> the cpufreq update. This could go on for an arbitrarily long time
>>>> depending on the periodicity of the work.
>>>
>>> Possible yes, but why do we care? Such a CPU would be so much idle that
>>> cpufreq doesn't matter one way or another, right?
>>
>> Well, in theory you can get 50% or so of the time active in bursts
>> that happen to fit between ticks.  If we happen to do those in the
>> lowest P-state, we may burn more energy than necessary on platforms
>> where more idle is preferred.
>
> At least intel_pstate should be able to figure out which P-state to
> use then on the APERF/MPERF basis.

Speaking for the generic case, it would be great to make use of such
feedback counters for selecting the next freq request. Use (num of
cycles used/total cycles) to figure out %ON time for the CPU. I
understand its not the goal for this patch series, but in the future
if we can do this in your callbacks where possible, then I think we
will do better than Ondemand.

Regards,
Ashwin.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 134+ messages in thread

* RE: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12 16:01                       ` Rafael J. Wysocki
  2016-02-12 16:15                         ` Rafael J. Wysocki
@ 2016-02-12 17:02                         ` Doug Smythies
  2016-02-12 23:17                           ` Rafael J. Wysocki
  1 sibling, 1 reply; 134+ messages in thread
From: Doug Smythies @ 2016-02-12 17:02 UTC (permalink / raw)
  To: 'Rafael J. Wysocki', 'Peter Zijlstra'
  Cc: 'Steve Muckle', 'Rafael J. Wysocki',
	'Linux PM list', 'Linux Kernel Mailing List',
	'Srinivas Pandruvada', 'Viresh Kumar',
	'Juri Lelli', 'Thomas Gleixner'

On 2016.02.12 08:01 Rafael J. Wysocki wrote:
> On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>>>> My concern above is that pokes are guaranteed to keep occurring when
>>>>> there is only RT or DL activity so nothing breaks.
>>>>
>>>> The hook in their respective tick handler should ensure stuff is called
>>>> sporadically and isn't stalled.
>>>
>>> But that's only true if the RT/DL tasks happen to be running when the
>>> tick arrives right?
>>>
>>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>>> the cpufreq update. This could go on for an arbitrarily long time
>>> depending on the periodicity of the work.
>>
>> Possible yes, but why do we care? Such a CPU would be so much idle that
>> cpufreq doesn't matter one way or another, right?

> Well, in theory you can get 50% or so of the time active in bursts
> that happen to fit between ticks.  If we happen to do those in the
> lowest P-state, we may burn more energy than necessary on platforms
> where more idle is preferred.

I believe this happens considerably more often than is commonly thought,
and is the exact reason I was opposed to the introduction of the
"duration" method into the intel_pstate driver in the first
place. The probability of occurrence (of a relatively busy CPU being idle
on jiffy boundaries) is very use dependant, occurring more on desktops than
servers, and sometime more with video frame rate based tasks. Data to support
my claim is a couple of years old and not very complete, but I see the issue
often on trace data acquired from desktop users on bugzilla reports.

Disclaimer: I fully admit that my related tests on the other thread have
been rigged to exaggerate the issue.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* RE: [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12 13:39           ` Rafael J. Wysocki
@ 2016-02-12 17:33             ` Doug Smythies
  2016-02-12 23:21               ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Doug Smythies @ 2016-02-12 17:33 UTC (permalink / raw)
  To: 'Rafael J. Wysocki'
  Cc: 'Rafael J. Wysocki', 'Srinivas Pandruvada',
	'Linux PM list', 'Ingo Molnar',
	'Linux Kernel Mailing List', 'Peter Zijlstra',
	'Viresh Kumar', 'Juri Lelli',
	'Steve Muckle', 'Thomas Gleixner'

On 2016.02.12 05:39 Rafael J. Wysocki wrote:
> On Fri, Feb 12, 2016 at 8:25 AM, Doug Smythies <dsmythies@telus.net> wrote:
>> On 2016.02.11 14:50 Doug Smythies wrote:
>>> On 2016.02.10 22:03 Srinivas Pandruvada wrote:
>>>> On Wednesday, February 10, 2016 03:11:43 PM Doug Smythies wrote:
>>
>>>>> My test computer has an older model i7 (Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz)
>>> Thanks Doug. If you have specific workloads, please compare performance.
>>
>>> My work so far has been testing functionality, with unrealistic workloads specifically
>>> designed to exaggerate issues, in this case the duration problem.
>>>
>>> I'll look at some real world workload scenarios.
>>
>> Turbostat used for package power, starts before Phoronix tests starts,
>> and ends after Phoronix test ends.
>>
>> Control Sample: Kernel 4.5-rc3:
>> Phoronix ffmpeg: turbostat 180 Sec. 12.07 Sec. Ave. 27.14 Watts.
>> Phoronix apache: turbostat 200 Sec. 19797.0 R.P.S. Ave. 34.01 Watts.
>> Phoronix kernel: turbostat 180 Sec. 139.93 Sec. 49.09 Watts.
>> Phoronix Postmark (Disk Test): turbostat 200 Sec. 5813 T.P.S. Ave. 21.33 Watts.
>>
>> Kernel 4.5-rc3 + RJW 3 patch set version 7:
>> Phoronix ffmpeg: turbostat 180 Sec. 11.67 Sec. Ave. 27.35 Watts.
>> Phoronix apache: turbostat 200 Sec. 19430.7 R.P.S. Ave. 34.18 Watts.
>> Phoronix kernel: turbostat 180 Sec. 139.81 Sec. 48.80 Watts.
>> Phoronix Postmark (Disk Test): turbostat 200 Sec. 5683 T.P.S. Ave. 22.41 Watts.

> Thanks for the results!
>
> The Postmark result is somewhat below expectations (especially with
> respect to the energy consumption), but we should be able to improve
> that by using the util numbers intelligently.
>
> Do you have full turbostat reports from those runs by any chance?  I'm
> wondering what happens to the idle state residencies, for example.

I did not keep the turbostat output, however it is easy enough to
re-do the tests. I'll send you the stuff off-list, and copy
Srinivas.

By the way, there is an anomaly in my 2 hour idle data (v7), where
CPU 7 should have had sample passes through the intel_pstate driver.
It did not, rather hitting the 4 second time limit instead.
10 occurrences in 7200 seconds. I sent you an off-list html format
e-mail with more details. There may be other anomalies I didn't
find yet.

... Doug

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12 16:53                           ` Ashwin Chaugule
@ 2016-02-12 23:14                             ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-12 23:14 UTC (permalink / raw)
  To: Ashwin Chaugule
  Cc: Rafael J. Wysocki, Peter Zijlstra, Steve Muckle,
	Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Thomas Gleixner

On Fri, Feb 12, 2016 at 5:53 PM, Ashwin Chaugule
<ashwin.chaugule@linaro.org> wrote:
> On 12 February 2016 at 11:15, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> On Fri, Feb 12, 2016 at 5:01 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>>> On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>>>>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>>>> >> My concern above is that pokes are guaranteed to keep occurring when
>>>>> >> > there is only RT or DL activity so nothing breaks.
>>>>> >
>>>>> > The hook in their respective tick handler should ensure stuff is called
>>>>> > sporadically and isn't stalled.
>>>>>
>>>>> But that's only true if the RT/DL tasks happen to be running when the
>>>>> tick arrives right?
>>>>>
>>>>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>>>>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>>>>> the cpufreq update. This could go on for an arbitrarily long time
>>>>> depending on the periodicity of the work.
>>>>
>>>> Possible yes, but why do we care? Such a CPU would be so much idle that
>>>> cpufreq doesn't matter one way or another, right?
>>>
>>> Well, in theory you can get 50% or so of the time active in bursts
>>> that happen to fit between ticks.  If we happen to do those in the
>>> lowest P-state, we may burn more energy than necessary on platforms
>>> where more idle is preferred.
>>
>> At least intel_pstate should be able to figure out which P-state to
>> use then on the APERF/MPERF basis.
>
> Speaking for the generic case, it would be great to make use of such
> feedback counters for selecting the next freq request. Use (num of
> cycles used/total cycles) to figure out %ON time for the CPU. I
> understand its not the goal for this patch series, but in the future
> if we can do this in your callbacks where possible, then I think we
> will do better than Ondemand.

Yes, we can do that at least in principle.  intel_pstate is a proof of that.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12 17:02                         ` Doug Smythies
@ 2016-02-12 23:17                           ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-12 23:17 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Peter Zijlstra, Steve Muckle,
	Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Thomas Gleixner

On Fri, Feb 12, 2016 at 6:02 PM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2016.02.12 08:01 Rafael J. Wysocki wrote:
>> On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>>>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>>>>> My concern above is that pokes are guaranteed to keep occurring when
>>>>>> there is only RT or DL activity so nothing breaks.
>>>>>
>>>>> The hook in their respective tick handler should ensure stuff is called
>>>>> sporadically and isn't stalled.
>>>>
>>>> But that's only true if the RT/DL tasks happen to be running when the
>>>> tick arrives right?
>>>>
>>>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>>>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>>>> the cpufreq update. This could go on for an arbitrarily long time
>>>> depending on the periodicity of the work.
>>>
>>> Possible yes, but why do we care? Such a CPU would be so much idle that
>>> cpufreq doesn't matter one way or another, right?
>
>> Well, in theory you can get 50% or so of the time active in bursts
>> that happen to fit between ticks.  If we happen to do those in the
>> lowest P-state, we may burn more energy than necessary on platforms
>> where more idle is preferred.
>
> I believe this happens considerably more often than is commonly thought,
> and is the exact reason I was opposed to the introduction of the
> "duration" method into the intel_pstate driver in the first
> place. The probability of occurrence (of a relatively busy CPU being idle
> on jiffy boundaries) is very use dependant, occurring more on desktops than
> servers, and sometime more with video frame rate based tasks. Data to support
> my claim is a couple of years old and not very complete, but I see the issue
> often on trace data acquired from desktop users on bugzilla reports.

The approach with update callbacks from the scheduler should not be
affected by this, because it takes updates not only at the tick time,
but also on other scheduler events.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v6 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12 17:33             ` Doug Smythies
@ 2016-02-12 23:21               ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-12 23:21 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Srinivas Pandruvada,
	Linux PM list, Ingo Molnar, Linux Kernel Mailing List,
	Peter Zijlstra, Viresh Kumar, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On Fri, Feb 12, 2016 at 6:33 PM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2016.02.12 05:39 Rafael J. Wysocki wrote:
>> On Fri, Feb 12, 2016 at 8:25 AM, Doug Smythies <dsmythies@telus.net> wrote:
>>> On 2016.02.11 14:50 Doug Smythies wrote:
>>>> On 2016.02.10 22:03 Srinivas Pandruvada wrote:
>>>>> On Wednesday, February 10, 2016 03:11:43 PM Doug Smythies wrote:
>>>
>>>>>> My test computer has an older model i7 (Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz)
>>>> Thanks Doug. If you have specific workloads, please compare performance.
>>>
>>>> My work so far has been testing functionality, with unrealistic workloads specifically
>>>> designed to exaggerate issues, in this case the duration problem.
>>>>
>>>> I'll look at some real world workload scenarios.
>>>
>>> Turbostat used for package power, starts before Phoronix tests starts,
>>> and ends after Phoronix test ends.
>>>
>>> Control Sample: Kernel 4.5-rc3:
>>> Phoronix ffmpeg: turbostat 180 Sec. 12.07 Sec. Ave. 27.14 Watts.
>>> Phoronix apache: turbostat 200 Sec. 19797.0 R.P.S. Ave. 34.01 Watts.
>>> Phoronix kernel: turbostat 180 Sec. 139.93 Sec. 49.09 Watts.
>>> Phoronix Postmark (Disk Test): turbostat 200 Sec. 5813 T.P.S. Ave. 21.33 Watts.
>>>
>>> Kernel 4.5-rc3 + RJW 3 patch set version 7:
>>> Phoronix ffmpeg: turbostat 180 Sec. 11.67 Sec. Ave. 27.35 Watts.
>>> Phoronix apache: turbostat 200 Sec. 19430.7 R.P.S. Ave. 34.18 Watts.
>>> Phoronix kernel: turbostat 180 Sec. 139.81 Sec. 48.80 Watts.
>>> Phoronix Postmark (Disk Test): turbostat 200 Sec. 5683 T.P.S. Ave. 22.41 Watts.
>
>> Thanks for the results!
>>
>> The Postmark result is somewhat below expectations (especially with
>> respect to the energy consumption), but we should be able to improve
>> that by using the util numbers intelligently.
>>
>> Do you have full turbostat reports from those runs by any chance?  I'm
>> wondering what happens to the idle state residencies, for example.
>
> I did not keep the turbostat output, however it is easy enough to
> re-do the tests. I'll send you the stuff off-list, and copy
> Srinivas.

Thanks!

> By the way, there is an anomaly in my 2 hour idle data (v7), where
> CPU 7 should have had sample passes through the intel_pstate driver.
> It did not, rather hitting the 4 second time limit instead.

That most likely means that we had not scheduled anything on that CPU
for that time.  Not entirely unlikely if the system was generally
mostly idle.

The CPU activity you observed might be related to interrupts in which
case we wouldn't receive updates from the scheduler.

> 10 occurrences in 7200 seconds. I sent you an off-list html format
> e-mail with more details. There may be other anomalies I didn't
> find yet.

Well, I guess we'll see.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-12 13:16         ` [PATCH v9 " Rafael J. Wysocki
@ 2016-02-15 21:47           ` Rafael J. Wysocki
  2016-02-18 20:22             ` Rafael J. Wysocki
  2016-03-09 12:35             ` Peter Zijlstra
  0 siblings, 2 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-15 21:47 UTC (permalink / raw)
  To: Linux PM list, Peter Zijlstra
  Cc: Ingo Molnar, Linux Kernel Mailing List, Srinivas Pandruvada,
	Viresh Kumar, Juri Lelli, Steve Muckle, Thomas Gleixner

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.

This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.

The update_load_avg() changes were suggested by Peter Zijlstra.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
---

Changes from v9:
- Move the additional RT/DL hooks back to update_curr_rt/dl() (Peter says
  that's OK), but only call them if updating the current CPU's rq, update
  the cpufreq_trigger_update() kerneldoc.

Changes from v8:
- Peter thinks that cpufreq hooks in update_curr_rt/dl() are overkill so
  move them to task_tick_rt/dl() and enqueue_task_rt/dl() (in case RT/DL
  tasks are only active between ticks), update the cpufreq_trigger_update()
  kerneldoc.

Changes from v7
- cpufreq_trigger_update() has a kerneldoc describing it as a band-aid to
  be replaced in the future and the comments next to its call sites ask
  the reader to see that comment.

  No functional changes. 

Changes from v6:
- Steve suggested to use rq_clock() instead of rq_clock_task() as the time
  argument for cpufreq_update_util() as that seems to be more suitable for
  this purpose.

---
 drivers/cpufreq/cpufreq.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   |   34 ++++++++++++++++++++++++++++++++++
 kernel/sched/deadline.c   |    4 ++++
 kernel/sched/fair.c       |   26 +++++++++++++++++++++++++-
 kernel/sched/rt.c         |    4 ++++
 kernel/sched/sched.h      |    1 +
 6 files changed, 113 insertions(+), 1 deletion(-)

Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -151,6 +151,36 @@ static inline bool policy_is_shared(stru
 extern struct kobject *cpufreq_global_kobject;
 
 #ifdef CONFIG_CPU_FREQ
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+
+/**
+ * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
+ * @time: Current time.
+ *
+ * The way cpufreq is currently arranged requires it to evaluate the CPU
+ * performance state (frequency/voltage) on a regular basis to prevent it from
+ * being stuck in a completely inadequate performance level for too long.
+ * That is not guaranteed to happen if the updates are only triggered from CFS,
+ * though, because they may not be coming in if RT or deadline tasks are active
+ * all the time (or there are RT and DL tasks only).
+ *
+ * As a workaround for that issue, this function is called by the RT and DL
+ * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * but that really is a band-aid.  Going forward it should be replaced with
+ * solutions targeted more specifically at RT and DL tasks.
+ */
+static inline void cpufreq_trigger_update(u64 time)
+{
+	cpufreq_update_util(time, ULONG_MAX, 0);
+}
+
+struct update_util_data {
+	void (*func)(struct update_util_data *data,
+		     u64 time, unsigned long util, unsigned long max);
+};
+
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+
 unsigned int cpufreq_get(unsigned int cpu);
 unsigned int cpufreq_quick_get(unsigned int cpu);
 unsigned int cpufreq_quick_get_max(unsigned int cpu);
@@ -162,6 +192,10 @@ int cpufreq_update_policy(unsigned int c
 bool have_governor_per_policy(void);
 struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
 #else
+static inline void cpufreq_update_util(u64 time, unsigned long util,
+				       unsigned long max) {}
+static inline void cpufreq_trigger_update(u64 time) {}
+
 static inline unsigned int cpufreq_get(unsigned int cpu)
 {
 	return 0;
Index: linux-pm/kernel/sched/sched.h
===================================================================
--- linux-pm.orig/kernel/sched/sched.h
+++ linux-pm/kernel/sched/sched.h
@@ -9,6 +9,7 @@
 #include <linux/irq_work.h>
 #include <linux/tick.h>
 #include <linux/slab.h>
+#include <linux/cpufreq.h>
 
 #include "cpupri.h"
 #include "cpudeadline.h"
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2824,7 +2824,8 @@ static inline void update_load_avg(struc
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
-	int cpu = cpu_of(rq_of(cfs_rq));
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -2836,6 +2837,29 @@ static inline void update_load_avg(struc
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
+
+	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+		unsigned long max = rq->cpu_capacity_orig;
+
+		/*
+		 * There are a few boundary cases this might miss but it should
+		 * get called often enough that that should (hopefully) not be
+		 * a real problem -- added to that it only calls on the local
+		 * CPU, so if we enqueue remotely we'll miss an update, but
+		 * the next tick/schedule should update.
+		 *
+		 * It will not get called when we go idle, because the idle
+		 * thread is a different class (!fair), nor will the utilization
+		 * number include things like RT tasks.
+		 *
+		 * As is, the util number is not freq-invariant (we'd have to
+		 * implement arch_scale_freq_capacity() for that).
+		 *
+		 * See cpu_util().
+		 */
+		cpufreq_update_util(rq_clock(rq),
+				    min(cfs_rq->avg.util_avg, max), max);
+	}
 }
 
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
Index: linux-pm/kernel/sched/deadline.c
===================================================================
--- linux-pm.orig/kernel/sched/deadline.c
+++ linux-pm/kernel/sched/deadline.c
@@ -726,6 +726,10 @@ static void update_curr_dl(struct rq *rq
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
 
+	/* Kick cpufreq (see the comment in linux/cpufreq.h). */
+	if (cpu_of(rq) == smp_processor_id())
+		cpufreq_trigger_update(rq_clock(rq));
+
 	/*
 	 * Consumed budget is computed considering the time as
 	 * observed by schedulable tasks (excluding time spent
Index: linux-pm/kernel/sched/rt.c
===================================================================
--- linux-pm.orig/kernel/sched/rt.c
+++ linux-pm/kernel/sched/rt.c
@@ -945,6 +945,10 @@ static void update_curr_rt(struct rq *rq
 	if (curr->sched_class != &rt_sched_class)
 		return;
 
+	/* Kick cpufreq (see the comment in linux/cpufreq.h). */
+	if (cpu_of(rq) == smp_processor_id())
+		cpufreq_trigger_update(rq_clock(rq));
+
 	delta_exec = rq_clock_task(rq) - curr->se.exec_start;
 	if (unlikely((s64)delta_exec <= 0))
 		return;
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -102,6 +102,51 @@ static LIST_HEAD(cpufreq_governor_list);
 static struct cpufreq_driver *cpufreq_driver;
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
 static DEFINE_RWLOCK(cpufreq_driver_lock);
+
+static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @data: New pointer value.
+ *
+ * Set and publish the update_util_data pointer for the given CPU.  That pointer
+ * points to a struct update_util_data object containing a callback function
+ * to call from cpufreq_update_util().  That function will be called from an RCU
+ * read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU callbacks to free any memory that might be accessed
+ * via the old update_util_data pointer or invoke synchronize_rcu() right after
+ * this function to avoid use-after-free.
+ */
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+{
+	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @time: Current time.
+ * @util: Current utilization.
+ * @max: Utilization ceiling.
+ *
+ * This function is called by the scheduler on every invocation of
+ * update_load_avg() on the CPU whose utilization is being updated.
+ */
+void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+{
+	struct update_util_data *data;
+
+	rcu_read_lock();
+
+	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
+	if (data && data->func)
+		data->func(data, time, util, max);
+
+	rcu_read_unlock();
+}
+
 DEFINE_MUTEX(cpufreq_governor_lock);
 
 /* Flag to suspend/resume CPUFreq governors */

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-15 21:47           ` [PATCH v10 " Rafael J. Wysocki
@ 2016-02-18 20:22             ` Rafael J. Wysocki
  2016-02-19  8:09               ` Juri Lelli
  2016-03-09 12:35             ` Peter Zijlstra
  1 sibling, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-18 20:22 UTC (permalink / raw)
  To: Linux PM list
  Cc: Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Steve Muckle,
	Thomas Gleixner, Rafael J. Wysocki

On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Introduce a mechanism by which parts of the cpufreq subsystem
> ("setpolicy" drivers or the core) can register callbacks to be
> executed from cpufreq_update_util() which is invoked by the
> scheduler's update_load_avg() on CPU utilization changes.
>
> This allows the "setpolicy" drivers to dispense with their timers
> and do all of the computations they need and frequency/voltage
> adjustments in the update_load_avg() code path, among other things.
>
> The update_load_avg() changes were suggested by Peter Zijlstra.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>
> Changes from v9:
> - Move the additional RT/DL hooks back to update_curr_rt/dl() (Peter says
>   that's OK), but only call them if updating the current CPU's rq, update
>   the cpufreq_trigger_update() kerneldoc.
>
> Changes from v8:
> - Peter thinks that cpufreq hooks in update_curr_rt/dl() are overkill so
>   move them to task_tick_rt/dl() and enqueue_task_rt/dl() (in case RT/DL
>   tasks are only active between ticks), update the cpufreq_trigger_update()
>   kerneldoc.
>
> Changes from v7
> - cpufreq_trigger_update() has a kerneldoc describing it as a band-aid to
>   be replaced in the future and the comments next to its call sites ask
>   the reader to see that comment.
>
>   No functional changes.
>
> Changes from v6:
> - Steve suggested to use rq_clock() instead of rq_clock_task() as the time
>   argument for cpufreq_update_util() as that seems to be more suitable for
>   this purpose.
>
> ---
>  drivers/cpufreq/cpufreq.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/cpufreq.h   |   34 ++++++++++++++++++++++++++++++++++
>  kernel/sched/deadline.c   |    4 ++++
>  kernel/sched/fair.c       |   26 +++++++++++++++++++++++++-
>  kernel/sched/rt.c         |    4 ++++
>  kernel/sched/sched.h      |    1 +
>  6 files changed, 113 insertions(+), 1 deletion(-)

So if anyone has any issues with this one, please let me know.

It has been in linux-next for a few days and seems to be doing well.

As I said previously, there is a metric ton of cpufreq improvements
depending on it, so I'd rather not delay integrating it any more.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-18 20:22             ` Rafael J. Wysocki
@ 2016-02-19  8:09               ` Juri Lelli
  2016-02-19 16:42                 ` Srinivas Pandruvada
  2016-02-19 22:14                 ` Rafael J. Wysocki
  0 siblings, 2 replies; 134+ messages in thread
From: Juri Lelli @ 2016-02-19  8:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Steve Muckle, Thomas Gleixner, Rafael J. Wysocki

Hi Rafael,

On 18/02/16 21:22, Rafael J. Wysocki wrote:
> On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >

[...]

> 
> So if anyone has any issues with this one, please let me know.
> 

I'm repeating myself a bit, but I'll try to articulate my only concern
once again anyway. I run some tests on a couple of arm boxes and I
didn't notice any regression or improvements for ondemand and
conservative (FWIW this might also work as a tested-by), so I tend to
take this series as a way to replace governor timers, making further
cleanups and fixes possibile. I think you already confirmed this and I
understand why you'd like this series to go in as I also think that what
we have on top is beneficial.

However, I still don't quite get why we want to introduce an interface
for explicit passing of util and max if we are not using such parameters
yet. Also, I couldn't find any indication of how such parameters will be
used in the future. If what we need today is a periodic kick for cpufreq
governors that need it, we should simply do how we already do for RT and
DL, IMHO. Also because the places where the current hooks reside might
not be the correct and useful one once we'll start using the utilization
parameters. I could probably make a case for DL where we should place
hooks in admission control path (or somewhere else when more
sophisticated mechanisms we'll be in place) rather then in the periodic
tick.

> It has been in linux-next for a few days and seems to be doing well.
> 
> As I said previously, there is a metric ton of cpufreq improvements
> depending on it, so I'd rather not delay integrating it any more.
> 

As said. I'm not against these changes since they open up to further
substantial fixes. I'm only wondering if we are doing the right thing
defining an interface that nobody is using and without an indication of
how such thing we'll be used in the future.

Best,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-19  8:09               ` Juri Lelli
@ 2016-02-19 16:42                 ` Srinivas Pandruvada
  2016-02-19 17:26                   ` Juri Lelli
  2016-02-19 17:28                   ` Steve Muckle
  2016-02-19 22:14                 ` Rafael J. Wysocki
  1 sibling, 2 replies; 134+ messages in thread
From: Srinivas Pandruvada @ 2016-02-19 16:42 UTC (permalink / raw)
  To: Juri Lelli, Rafael J. Wysocki
  Cc: Linux PM list, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Viresh Kumar, Steve Muckle,
	Thomas Gleixner, Rafael J. Wysocki

On Fri, 2016-02-19 at 08:09 +0000, Juri Lelli wrote:
Hi Juri,
> > 
> Hi Rafael,
> 
> On 18/02/16 21:22, Rafael J. Wysocki wrote:
> > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.
> > net> wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > 
> 
[...]

> However, I still don't quite get why we want to introduce an
> interface
> for explicit passing of util and max if we are not using such
> parameters
> yet. Also, I couldn't find any indication of how such parameters will
> be
> used in the future. If what we need today is a periodic kick for
> cpufreq
> governors that need it, we should simply do how we already do for RT
> and
> DL, IMHO. Also because the places where the current hooks reside
> might
> not be the correct and useful one once we'll start using the
> utilization
> parameters. I could probably make a case for DL where we should place
> hooks in admission control path (or somewhere else when more
> sophisticated mechanisms we'll be in place) rather then in the
> periodic
> tick.
We did experiments using util/max in intel_pstate. For some benchmarks
there were regression of 4 to 5%, for some benchmarks it performed at
par with getting utilization from the processor. Further optimization
in the algorithm is possible and still in progress. Idea is that we can
change P-State fast enough and be more reactive. Once I have good data,
I will send to this list. The algorithm can be part of the cpufreq
governor too.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-19 16:42                 ` Srinivas Pandruvada
@ 2016-02-19 17:26                   ` Juri Lelli
  2016-02-19 22:26                     ` Rafael J. Wysocki
  2016-02-19 17:28                   ` Steve Muckle
  1 sibling, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-02-19 17:26 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Rafael J. Wysocki, Linux PM list, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Viresh Kumar, Steve Muckle,
	Thomas Gleixner, Rafael J. Wysocki

Hi Srinivas,

On 19/02/16 08:42, Srinivas Pandruvada wrote:
> On Fri, 2016-02-19 at 08:09 +0000, Juri Lelli wrote:
> Hi Juri,
> > > 
> > Hi Rafael,
> > 
> > On 18/02/16 21:22, Rafael J. Wysocki wrote:
> > > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.
> > > net> wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > 
> > 
> [...]
> 
> > However, I still don't quite get why we want to introduce an
> > interface
> > for explicit passing of util and max if we are not using such
> > parameters
> > yet. Also, I couldn't find any indication of how such parameters will
> > be
> > used in the future. If what we need today is a periodic kick for
> > cpufreq
> > governors that need it, we should simply do how we already do for RT
> > and
> > DL, IMHO. Also because the places where the current hooks reside
> > might
> > not be the correct and useful one once we'll start using the
> > utilization
> > parameters. I could probably make a case for DL where we should place
> > hooks in admission control path (or somewhere else when more
> > sophisticated mechanisms we'll be in place) rather then in the
> > periodic
> > tick.
> We did experiments using util/max in intel_pstate. For some benchmarks
> there were regression of 4 to 5%, for some benchmarks it performed at
> par with getting utilization from the processor. Further optimization
> in the algorithm is possible and still in progress. Idea is that we can
> change P-State fast enough and be more reactive. Once I have good data,
> I will send to this list. The algorithm can be part of the cpufreq
> governor too.
> 

Thanks for your answer. What you are experimenting with looks really
interesting and I'm certainly more than interested in looking at your
findings and patches when they will hit the list.

My point was more on what we can look at today, though. Without a clear
understanding about how and where util and max will be used and from
which scheduler paths such information should come from, it is a bit
difficult to tell if the current interface and hooks are fine, IMHO.
I'd suggest we leave this part to the discussion we will have once your
proposal will be public; and to facilitate that we should remove those
arguments from the current interface.

Best,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-19 16:42                 ` Srinivas Pandruvada
  2016-02-19 17:26                   ` Juri Lelli
@ 2016-02-19 17:28                   ` Steve Muckle
  2016-02-19 22:35                     ` Rafael J. Wysocki
  2016-02-22 10:52                     ` Peter Zijlstra
  1 sibling, 2 replies; 134+ messages in thread
From: Steve Muckle @ 2016-02-19 17:28 UTC (permalink / raw)
  To: Srinivas Pandruvada, Juri Lelli, Rafael J. Wysocki
  Cc: Linux PM list, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Viresh Kumar, Thomas Gleixner,
	Rafael J. Wysocki

On 02/19/2016 08:42 AM, Srinivas Pandruvada wrote:
> We did experiments using util/max in intel_pstate. For some benchmarks
> there were regression of 4 to 5%, for some benchmarks it performed at
> par with getting utilization from the processor. Further optimization
> in the algorithm is possible and still in progress. Idea is that we can
> change P-State fast enough and be more reactive. Once I have good data,
> I will send to this list. The algorithm can be part of the cpufreq
> governor too.

There has been a lot of work in the area of scheduler-driven CPU
frequency selection by Linaro and ARM as well. It was posted most
recently a couple months ago:

http://thread.gmane.org/gmane.linux.power-management.general/69176

It was also posted as part of the energy-aware scheduling series last
July. There's a new RFC series forthcoming which I had hoped (and
failed) to post prior to my business travel this week; it should be out
next week. It will address the feedback received thus far along with
locking and other things.

The scheduler hooks for utilization-based cpufreq operation deserve a
lot more debate I think. They could quite possibly have different
requirements than hooks which are chosen just to guarantee periodic
callbacks into sampling-based governors.

For my part I think it would be best if the util/max parameters are
omitted until it's clear whether these same hooks can be effectively
used for architecture agnostic scheduler-guided (capacity driven) CPU
frequency support. My upcoming RFC will provide another opportunity to
debate the hooks as well as how scheduler-guided CPU frequency should be
structured.

thanks,
Steve

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-19  8:09               ` Juri Lelli
  2016-02-19 16:42                 ` Srinivas Pandruvada
@ 2016-02-19 22:14                 ` Rafael J. Wysocki
  2016-02-22  9:32                   ` Juri Lelli
  1 sibling, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-19 22:14 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Linux PM list, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Steve Muckle, Thomas Gleixner

On Friday, February 19, 2016 08:09:17 AM Juri Lelli wrote:
> Hi Rafael,
> 
> On 18/02/16 21:22, Rafael J. Wysocki wrote:
> > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> 
> [...]
> 
> > 
> > So if anyone has any issues with this one, please let me know.
> > 
> 
> I'm repeating myself a bit, but I'll try to articulate my only concern
> once again anyway. I run some tests on a couple of arm boxes and I
> didn't notice any regression or improvements for ondemand and
> conservative (FWIW this might also work as a tested-by), so I tend to
> take this series as a way to replace governor timers, making further
> cleanups and fixes possibile. I think you already confirmed this and I
> understand why you'd like this series to go in as I also think that what
> we have on top is beneficial.

OK

> However, I still don't quite get why we want to introduce an interface
> for explicit passing of util and max if we are not using such parameters
> yet. Also, I couldn't find any indication of how such parameters will be
> used in the future. If what we need today is a periodic kick for cpufreq
> governors that need it, we should simply do how we already do for RT and
> DL, IMHO. Also because the places where the current hooks reside might
> not be the correct and useful one once we'll start using the utilization
> parameters. I could probably make a case for DL where we should place
> hooks in admission control path (or somewhere else when more
> sophisticated mechanisms we'll be in place) rather then in the periodic
> tick.

Well, the hook in DL is explicitly denoted as a temporary band-aid.

I and Srinivas have said for multiple times that we are going to use the
scheduler's utilization data in intel_pstate.  Admittedly, we haven't shown
any patches implementing that, but that's because Srinivas doesn't regard
that work as ready yet.

I also have something for the general cpufreq in the works.  I may be able
to send it as an RFC over the weekend, depending on how much time I can
spend on it.

That said, if the concern is that there are plans to change the way the
scheduler computes the utilization numbers and that may become difficult to
carry out if cpufreq starts to depend on them in their current form, then I
may agree that it is valid, but I'm not aware of those plans ATM.

However, if the numbers are going to stay what they are, I don't see why
passing them to cpufreq may possibly become problematic at any point.

> > It has been in linux-next for a few days and seems to be doing well.
> > 
> > As I said previously, there is a metric ton of cpufreq improvements
> > depending on it, so I'd rather not delay integrating it any more.
> > 
> 
> As said. I'm not against these changes since they open up to further
> substantial fixes.

Good. :-)

> I'm only wondering if we are doing the right thing defining an interface
> that nobody is using and without an indication of how such thing we'll be
> used in the future.

That indication may be coming though. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-19 17:26                   ` Juri Lelli
@ 2016-02-19 22:26                     ` Rafael J. Wysocki
  2016-02-22  9:42                       ` Juri Lelli
  2016-02-22 10:45                       ` Viresh Kumar
  0 siblings, 2 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-19 22:26 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Srinivas Pandruvada, Rafael J. Wysocki, Linux PM list,
	Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Viresh Kumar, Steve Muckle, Thomas Gleixner

On Friday, February 19, 2016 05:26:04 PM Juri Lelli wrote:
> Hi Srinivas,
> 
> On 19/02/16 08:42, Srinivas Pandruvada wrote:
> > On Fri, 2016-02-19 at 08:09 +0000, Juri Lelli wrote:
> > Hi Juri,
> > > > 
> > > Hi Rafael,
> > > 
> > > On 18/02/16 21:22, Rafael J. Wysocki wrote:
> > > > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.
> > > > net> wrote:
> > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > 
> > > 
> > [...]
> > 
> > > However, I still don't quite get why we want to introduce an
> > > interface
> > > for explicit passing of util and max if we are not using such
> > > parameters
> > > yet. Also, I couldn't find any indication of how such parameters will
> > > be
> > > used in the future. If what we need today is a periodic kick for
> > > cpufreq
> > > governors that need it, we should simply do how we already do for RT
> > > and
> > > DL, IMHO. Also because the places where the current hooks reside
> > > might
> > > not be the correct and useful one once we'll start using the
> > > utilization
> > > parameters. I could probably make a case for DL where we should place
> > > hooks in admission control path (or somewhere else when more
> > > sophisticated mechanisms we'll be in place) rather then in the
> > > periodic
> > > tick.
> > We did experiments using util/max in intel_pstate. For some benchmarks
> > there were regression of 4 to 5%, for some benchmarks it performed at
> > par with getting utilization from the processor. Further optimization
> > in the algorithm is possible and still in progress. Idea is that we can
> > change P-State fast enough and be more reactive. Once I have good data,
> > I will send to this list. The algorithm can be part of the cpufreq
> > governor too.
> > 
> 
> Thanks for your answer. What you are experimenting with looks really
> interesting and I'm certainly more than interested in looking at your
> findings and patches when they will hit the list.
> 
> My point was more on what we can look at today, though. Without a clear
> understanding about how and where util and max will be used and from
> which scheduler paths such information should come from, it is a bit
> difficult to tell if the current interface and hooks are fine, IMHO.

As I've just said, I may be able to show something shortly.

> I'd suggest we leave this part to the discussion we will have once your
> proposal will be public; and to facilitate that we should remove those
> arguments from the current interface.

I'm not really sure how this will help apart from removing some tiny extra
overhead that is expected to be temporary anyway.

That said, since both you and Steve are making the point that the utilization
arguments are problematic and I'd really like to be able to make progress here,
I don't have any fundamental problems with dropping them for the time being,
but I'm not going to rebase the 50+ commits I have queued up on top of the
$subject patch.

So I can apply something like the appended patch if that helps to address
your concerns.

Thanks,
Rafael


---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] cpufreq: Rework the scheduler hooks for triggering updates

Commit fe7034338ba0 (cpufreq: Add mechanism for registering
utilization update callbacks) added cpufreq_update_util() to be
called by the scheduler (from the CFS part) on utilization updates.
The goal was to allow CFS to pass utilization information to cpufreq
and to trigger it to evaluate the frequency/voltage configuration
(P-state) of every CPU on a regular basis.

However, the last two arguments of that function are never used by
the current code, so CFS might simply call cpufreq_trigger_update()
instead of it.

For this reason, drop the last two arguments of cpufreq_update_util(),
rename it to cpufreq_trigger_update() and modify CFS to call it.

Moreover, since the utilization is not involved in that now, rename
data types, functions and variables related to cpufreq_trigger_update()
to reflect that (eg. struct update_util_data becomes struct
freq_update_hook and so on).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/cpufreq.c          |   48 ++++++++++++++++++++++---------------
 drivers/cpufreq/cpufreq_governor.c |   27 ++++++++++----------
 drivers/cpufreq/cpufreq_governor.h |    2 -
 drivers/cpufreq/intel_pstate.c     |   15 +++++------
 include/linux/cpufreq.h            |   32 +++---------------------
 kernel/sched/deadline.c            |    2 -
 kernel/sched/fair.c                |   13 +---------
 kernel/sched/rt.c                  |    2 -
 8 files changed, 58 insertions(+), 83 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -103,46 +103,56 @@ static struct cpufreq_driver *cpufreq_dr
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
 static DEFINE_RWLOCK(cpufreq_driver_lock);
 
-static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
 
 /**
- * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * cpufreq_set_freq_update_hook - Populate the CPU's freq_update_hook pointer.
  * @cpu: The CPU to set the pointer for.
  * @data: New pointer value.
  *
- * Set and publish the update_util_data pointer for the given CPU.  That pointer
- * points to a struct update_util_data object containing a callback function
- * to call from cpufreq_update_util().  That function will be called from an RCU
- * read-side critical section, so it must not sleep.
+ * Set and publish the freq_update_hook pointer for the given CPU.  That pointer
+ * points to a struct freq_update_hook object containing a callback function
+ * to call from cpufreq_trigger_update().  That function will be called from
+ * an RCU read-side critical section, so it must not sleep.
  *
  * Callers must use RCU callbacks to free any memory that might be accessed
  * via the old update_util_data pointer or invoke synchronize_rcu() right after
  * this function to avoid use-after-free.
  */
-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook)
 {
-	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+	rcu_assign_pointer(per_cpu(cpufreq_freq_update_hook, cpu), hook);
 }
-EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+EXPORT_SYMBOL_GPL(cpufreq_set_freq_update_hook);
 
 /**
- * cpufreq_update_util - Take a note about CPU utilization changes.
+ * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
  * @time: Current time.
- * @util: Current utilization.
- * @max: Utilization ceiling.
  *
- * This function is called by the scheduler on every invocation of
- * update_load_avg() on the CPU whose utilization is being updated.
+ * The way cpufreq is currently arranged requires it to evaluate the CPU
+ * performance state (frequency/voltage) on a regular basis.  To facilitate
+ * that, this function is called by update_load_avg() in CFS when executed for
+ * the current CPU's runqueue.
+ *
+ * However, this isn't sufficient to prevent the CPU from being stuck in a
+ * completely inadequate performance level for too long, because the calls
+ * from CFS will not be made if RT or deadline tasks are active all the time
+ * (or there are RT and DL tasks only).
+ *
+ * As a workaround for that issue, this function is called by the RT and DL
+ * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * but that really is a band-aid.  Going forward it should be replaced with
+ * solutions targeted more specifically at RT and DL tasks.
  */
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+void cpufreq_trigger_update(u64 time)
 {
-	struct update_util_data *data;
+	struct freq_update_hook *hook;
 
 	rcu_read_lock();
 
-	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
-	if (data && data->func)
-		data->func(data, time, util, max);
+	hook = rcu_dereference(*this_cpu_ptr(&cpufreq_freq_update_hook));
+	if (hook && hook->func)
+		hook->func(hook, time);
 
 	rcu_read_unlock();
 }
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -147,35 +147,13 @@ static inline bool policy_is_shared(stru
 extern struct kobject *cpufreq_global_kobject;
 
 #ifdef CONFIG_CPU_FREQ
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+void cpufreq_trigger_update(u64 time);
 
-/**
- * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
- * @time: Current time.
- *
- * The way cpufreq is currently arranged requires it to evaluate the CPU
- * performance state (frequency/voltage) on a regular basis to prevent it from
- * being stuck in a completely inadequate performance level for too long.
- * That is not guaranteed to happen if the updates are only triggered from CFS,
- * though, because they may not be coming in if RT or deadline tasks are active
- * all the time (or there are RT and DL tasks only).
- *
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
- * but that really is a band-aid.  Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
- */
-static inline void cpufreq_trigger_update(u64 time)
-{
-	cpufreq_update_util(time, ULONG_MAX, 0);
-}
-
-struct update_util_data {
-	void (*func)(struct update_util_data *data,
-		     u64 time, unsigned long util, unsigned long max);
+struct freq_update_hook {
+	void (*func)(struct freq_update_hook *hook, u64 time);
 };
 
-void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook);
 
 unsigned int cpufreq_get(unsigned int cpu);
 unsigned int cpufreq_quick_get(unsigned int cpu);
@@ -188,8 +166,6 @@ int cpufreq_update_policy(unsigned int c
 bool have_governor_per_policy(void);
 struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
 #else
-static inline void cpufreq_update_util(u64 time, unsigned long util,
-				       unsigned long max) {}
 static inline void cpufreq_trigger_update(u64 time) {}
 
 static inline unsigned int cpufreq_get(unsigned int cpu)
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -62,10 +62,10 @@ ssize_t store_sampling_rate(struct dbs_d
 		mutex_lock(&policy_dbs->timer_mutex);
 		/*
 		 * On 32-bit architectures this may race with the
-		 * sample_delay_ns read in dbs_update_util_handler(), but that
+		 * sample_delay_ns read in dbs_freq_update_handler(), but that
 		 * really doesn't matter.  If the read returns a value that's
 		 * too big, the sample will be skipped, but the next invocation
-		 * of dbs_update_util_handler() (when the update has been
+		 * of dbs_freq_update_handler() (when the update has been
 		 * completed) will take a sample.
 		 *
 		 * If this runs in parallel with dbs_work_handler(), we may end
@@ -261,7 +261,7 @@ unsigned int dbs_update(struct cpufreq_p
 }
 EXPORT_SYMBOL_GPL(dbs_update);
 
-void gov_set_update_util(struct policy_dbs_info *policy_dbs,
+void gov_set_freq_update_hooks(struct policy_dbs_info *policy_dbs,
 			 unsigned int delay_us)
 {
 	struct cpufreq_policy *policy = policy_dbs->policy;
@@ -273,17 +273,17 @@ void gov_set_update_util(struct policy_d
 	for_each_cpu(cpu, policy->cpus) {
 		struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
 
-		cpufreq_set_update_util_data(cpu, &cdbs->update_util);
+		cpufreq_set_freq_update_hook(cpu, &cdbs->update_hook);
 	}
 }
-EXPORT_SYMBOL_GPL(gov_set_update_util);
+EXPORT_SYMBOL_GPL(gov_set_freq_update_hooks);
 
-static inline void gov_clear_update_util(struct cpufreq_policy *policy)
+static inline void gov_clear_freq_update_hooks(struct cpufreq_policy *policy)
 {
 	int i;
 
 	for_each_cpu(i, policy->cpus)
-		cpufreq_set_update_util_data(i, NULL);
+		cpufreq_set_freq_update_hook(i, NULL);
 
 	synchronize_rcu();
 }
@@ -292,7 +292,7 @@ static void gov_cancel_work(struct cpufr
 {
 	struct policy_dbs_info *policy_dbs = policy->governor_data;
 
-	gov_clear_update_util(policy_dbs->policy);
+	gov_clear_freq_update_hooks(policy_dbs->policy);
 	irq_work_sync(&policy_dbs->irq_work);
 	cancel_work_sync(&policy_dbs->work);
 	atomic_set(&policy_dbs->work_count, 0);
@@ -336,10 +336,9 @@ static void dbs_irq_work(struct irq_work
 	schedule_work(&policy_dbs->work);
 }
 
-static void dbs_update_util_handler(struct update_util_data *data, u64 time,
-				    unsigned long util, unsigned long max)
+static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time)
 {
-	struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, update_util);
+	struct cpu_dbs_info *cdbs = container_of(hook, struct cpu_dbs_info, update_hook);
 	struct policy_dbs_info *policy_dbs = cdbs->policy_dbs;
 	u64 delta_ns;
 
@@ -397,7 +396,7 @@ static struct policy_dbs_info *alloc_pol
 		struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
 
 		j_cdbs->policy_dbs = policy_dbs;
-		j_cdbs->update_util.func = dbs_update_util_handler;
+		j_cdbs->update_hook.func = dbs_freq_update_handler;
 	}
 	return policy_dbs;
 }
@@ -414,7 +413,7 @@ static void free_policy_dbs_info(struct
 		struct cpu_dbs_info *j_cdbs = &per_cpu(cpu_dbs, j);
 
 		j_cdbs->policy_dbs = NULL;
-		j_cdbs->update_util.func = NULL;
+		j_cdbs->update_hook.func = NULL;
 	}
 	gov->free(policy_dbs);
 }
@@ -581,7 +580,7 @@ static int cpufreq_governor_start(struct
 
 	gov->start(policy);
 
-	gov_set_update_util(policy_dbs, sampling_rate);
+	gov_set_freq_update_hooks(policy_dbs, sampling_rate);
 	return 0;
 }
 
Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
+++ linux-pm/drivers/cpufreq/cpufreq_governor.h
@@ -144,7 +144,7 @@ struct cpu_dbs_info {
 	 * wake-up from idle.
 	 */
 	unsigned int prev_load;
-	struct update_util_data update_util;
+	struct freq_update_hook update_hook;
 	struct policy_dbs_info *policy_dbs;
 };
 
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -103,7 +103,7 @@ struct _pid {
 struct cpudata {
 	int cpu;
 
-	struct update_util_data update_util;
+	struct freq_update_hook update_hook;
 
 	struct pstate_data pstate;
 	struct vid_data vid;
@@ -1013,10 +1013,9 @@ static inline void intel_pstate_adjust_b
 		sample->freq);
 }
 
-static void intel_pstate_update_util(struct update_util_data *data, u64 time,
-				     unsigned long util, unsigned long max)
+static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time)
 {
-	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+	struct cpudata *cpu = container_of(hook, struct cpudata, update_hook);
 	u64 delta_ns = time - cpu->sample.time;
 
 	if ((s64)delta_ns >= pid_params.sample_rate_ns) {
@@ -1082,8 +1081,8 @@ static int intel_pstate_init_cpu(unsigne
 	intel_pstate_busy_pid_reset(cpu);
 	intel_pstate_sample(cpu, 0);
 
-	cpu->update_util.func = intel_pstate_update_util;
-	cpufreq_set_update_util_data(cpunum, &cpu->update_util);
+	cpu->update_hook.func = intel_pstate_freq_update;
+	cpufreq_set_freq_update_hook(cpunum, &cpu->update_hook);
 
 	pr_debug("intel_pstate: controlling: cpu %d\n", cpunum);
 
@@ -1167,7 +1166,7 @@ static void intel_pstate_stop_cpu(struct
 
 	pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
 
-	cpufreq_set_update_util_data(cpu_num, NULL);
+	cpufreq_set_freq_update_hook(cpu_num, NULL);
 	synchronize_rcu();
 
 	if (hwp_active)
@@ -1425,7 +1424,7 @@ out:
 	get_online_cpus();
 	for_each_online_cpu(cpu) {
 		if (all_cpu_data[cpu]) {
-			cpufreq_set_update_util_data(cpu, NULL);
+			cpufreq_set_freq_update_hook(cpu, NULL);
 			synchronize_rcu();
 			kfree(all_cpu_data[cpu]);
 		}
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2839,8 +2839,6 @@ static inline void update_load_avg(struc
 		update_tg_load_avg(cfs_rq, 0);
 
 	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
-		unsigned long max = rq->cpu_capacity_orig;
-
 		/*
 		 * There are a few boundary cases this might miss but it should
 		 * get called often enough that that should (hopefully) not be
@@ -2849,16 +2847,9 @@ static inline void update_load_avg(struc
 		 * the next tick/schedule should update.
 		 *
 		 * It will not get called when we go idle, because the idle
-		 * thread is a different class (!fair), nor will the utilization
-		 * number include things like RT tasks.
-		 *
-		 * As is, the util number is not freq-invariant (we'd have to
-		 * implement arch_scale_freq_capacity() for that).
-		 *
-		 * See cpu_util().
+		 * thread is a different class (!fair).
 		 */
-		cpufreq_update_util(rq_clock(rq),
-				    min(cfs_rq->avg.util_avg, max), max);
+		cpufreq_trigger_update(rq_clock(rq));
 	}
 }
 
Index: linux-pm/kernel/sched/deadline.c
===================================================================
--- linux-pm.orig/kernel/sched/deadline.c
+++ linux-pm/kernel/sched/deadline.c
@@ -726,7 +726,7 @@ static void update_curr_dl(struct rq *rq
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
 
-	/* Kick cpufreq (see the comment in linux/cpufreq.h). */
+	/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
 	if (cpu_of(rq) == smp_processor_id())
 		cpufreq_trigger_update(rq_clock(rq));
 
Index: linux-pm/kernel/sched/rt.c
===================================================================
--- linux-pm.orig/kernel/sched/rt.c
+++ linux-pm/kernel/sched/rt.c
@@ -945,7 +945,7 @@ static void update_curr_rt(struct rq *rq
 	if (curr->sched_class != &rt_sched_class)
 		return;
 
-	/* Kick cpufreq (see the comment in linux/cpufreq.h). */
+	/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
 	if (cpu_of(rq) == smp_processor_id())
 		cpufreq_trigger_update(rq_clock(rq));
 

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-19 17:28                   ` Steve Muckle
@ 2016-02-19 22:35                     ` Rafael J. Wysocki
  2016-02-23  3:58                       ` Steve Muckle
  2016-02-22 10:52                     ` Peter Zijlstra
  1 sibling, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-19 22:35 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Srinivas Pandruvada, Juri Lelli, Rafael J. Wysocki, Linux PM list,
	Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Viresh Kumar, Thomas Gleixner

On Friday, February 19, 2016 09:28:23 AM Steve Muckle wrote:
> On 02/19/2016 08:42 AM, Srinivas Pandruvada wrote:
> > We did experiments using util/max in intel_pstate. For some benchmarks
> > there were regression of 4 to 5%, for some benchmarks it performed at
> > par with getting utilization from the processor. Further optimization
> > in the algorithm is possible and still in progress. Idea is that we can
> > change P-State fast enough and be more reactive. Once I have good data,
> > I will send to this list. The algorithm can be part of the cpufreq
> > governor too.
> 
> There has been a lot of work in the area of scheduler-driven CPU
> frequency selection by Linaro and ARM as well. It was posted most
> recently a couple months ago:
> 
> http://thread.gmane.org/gmane.linux.power-management.general/69176
> 
> It was also posted as part of the energy-aware scheduling series last
> July. There's a new RFC series forthcoming which I had hoped (and
> failed) to post prior to my business travel this week; it should be out
> next week. It will address the feedback received thus far along with
> locking and other things.
> 
> The scheduler hooks for utilization-based cpufreq operation deserve a
> lot more debate I think. They could quite possibly have different
> requirements than hooks which are chosen just to guarantee periodic
> callbacks into sampling-based governors.

Yes, they could.

The point here, though, is that even the sampling-based governors may 
benefit from using the numbers provided by the scheduler instead of trying
to come up with analogous numbers themselves.

> For my part I think it would be best if the util/max parameters are
> omitted

OK, so please see the patch I've just sent to Juri:

https://patchwork.kernel.org/patch/8364621/

> until it's clear whether these same hooks can be effectively
> used for architecture agnostic scheduler-guided (capacity driven) CPU
> frequency support.

Well, if they can't, then we'll need to move the hooks, but I'm not sure
how this is related to the arguments they take.

> My upcoming RFC will provide another opportunity to debate the hooks as
> well as how scheduler-guided CPU frequency should be structured.

OK, looking forward to seeing the RFC then. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-19 22:14                 ` Rafael J. Wysocki
@ 2016-02-22  9:32                   ` Juri Lelli
  2016-02-22 21:26                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-02-22  9:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM list, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Steve Muckle, Thomas Gleixner

On 19/02/16 23:14, Rafael J. Wysocki wrote:
> On Friday, February 19, 2016 08:09:17 AM Juri Lelli wrote:
> > Hi Rafael,
> > 
> > On 18/02/16 21:22, Rafael J. Wysocki wrote:
> > > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >
> > 
> > [...]
> > 
> > > 
> > > So if anyone has any issues with this one, please let me know.
> > > 
> > 
> > I'm repeating myself a bit, but I'll try to articulate my only concern
> > once again anyway. I run some tests on a couple of arm boxes and I
> > didn't notice any regression or improvements for ondemand and
> > conservative (FWIW this might also work as a tested-by), so I tend to
> > take this series as a way to replace governor timers, making further
> > cleanups and fixes possibile. I think you already confirmed this and I
> > understand why you'd like this series to go in as I also think that what
> > we have on top is beneficial.
> 
> OK
> 
> > However, I still don't quite get why we want to introduce an interface
> > for explicit passing of util and max if we are not using such parameters
> > yet. Also, I couldn't find any indication of how such parameters will be
> > used in the future. If what we need today is a periodic kick for cpufreq
> > governors that need it, we should simply do how we already do for RT and
> > DL, IMHO. Also because the places where the current hooks reside might
> > not be the correct and useful one once we'll start using the utilization
> > parameters. I could probably make a case for DL where we should place
> > hooks in admission control path (or somewhere else when more
> > sophisticated mechanisms we'll be in place) rather then in the periodic
> > tick.
> 
> Well, the hook in DL is explicitly denoted as a temporary band-aid.
> 
> I and Srinivas have said for multiple times that we are going to use the
> scheduler's utilization data in intel_pstate.  Admittedly, we haven't shown
> any patches implementing that, but that's because Srinivas doesn't regard
> that work as ready yet.
> 
> I also have something for the general cpufreq in the works.  I may be able
> to send it as an RFC over the weekend, depending on how much time I can
> spend on it.
> 

Saw that, thanks. Please allow me some time to review and test. :-)

> That said, if the concern is that there are plans to change the way the
> scheduler computes the utilization numbers and that may become difficult to
> carry out if cpufreq starts to depend on them in their current form, then I
> may agree that it is valid, but I'm not aware of those plans ATM.
> 

No, I don't think there's any substantial discussion going on about the
utilization numbers.

> However, if the numbers are going to stay what they are, I don't see why
> passing them to cpufreq may possibly become problematic at any point.

My concern was mostly on the fact that there is already another RFC
under discussion that uses the same numbers and has different hooks
placed in scheduler code (Steve's sched-freq); so, additional hooks
might generate confusion, IMHO.
 
> > > It has been in linux-next for a few days and seems to be doing well.
> > > 
> > > As I said previously, there is a metric ton of cpufreq improvements
> > > depending on it, so I'd rather not delay integrating it any more.
> > > 
> > 
> > As said. I'm not against these changes since they open up to further
> > substantial fixes.
> 
> Good. :-)
> 
> > I'm only wondering if we are doing the right thing defining an interface
> > that nobody is using and without an indication of how such thing we'll be
> > used in the future.
> 
> That indication may be coming though. :-)
> 

Thanks again. I'm going to have a look at that.

Best,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-19 22:26                     ` Rafael J. Wysocki
@ 2016-02-22  9:42                       ` Juri Lelli
  2016-02-22 21:41                         ` Rafael J. Wysocki
  2016-02-22 10:45                       ` Viresh Kumar
  1 sibling, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-02-22  9:42 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Srinivas Pandruvada, Rafael J. Wysocki, Linux PM list,
	Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Viresh Kumar, Steve Muckle, Thomas Gleixner

Hi Rafael,

On 19/02/16 23:26, Rafael J. Wysocki wrote:
> On Friday, February 19, 2016 05:26:04 PM Juri Lelli wrote:
> > Hi Srinivas,
> > 
> > On 19/02/16 08:42, Srinivas Pandruvada wrote:
> > > On Fri, 2016-02-19 at 08:09 +0000, Juri Lelli wrote:
> > > Hi Juri,
> > > > > 
> > > > Hi Rafael,
> > > > 
> > > > On 18/02/16 21:22, Rafael J. Wysocki wrote:
> > > > > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.
> > > > > net> wrote:
> > > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > > 
> > > > 
> > > [...]
> > > 
> > > > However, I still don't quite get why we want to introduce an
> > > > interface
> > > > for explicit passing of util and max if we are not using such
> > > > parameters
> > > > yet. Also, I couldn't find any indication of how such parameters will
> > > > be
> > > > used in the future. If what we need today is a periodic kick for
> > > > cpufreq
> > > > governors that need it, we should simply do how we already do for RT
> > > > and
> > > > DL, IMHO. Also because the places where the current hooks reside
> > > > might
> > > > not be the correct and useful one once we'll start using the
> > > > utilization
> > > > parameters. I could probably make a case for DL where we should place
> > > > hooks in admission control path (or somewhere else when more
> > > > sophisticated mechanisms we'll be in place) rather then in the
> > > > periodic
> > > > tick.
> > > We did experiments using util/max in intel_pstate. For some benchmarks
> > > there were regression of 4 to 5%, for some benchmarks it performed at
> > > par with getting utilization from the processor. Further optimization
> > > in the algorithm is possible and still in progress. Idea is that we can
> > > change P-State fast enough and be more reactive. Once I have good data,
> > > I will send to this list. The algorithm can be part of the cpufreq
> > > governor too.
> > > 
> > 
> > Thanks for your answer. What you are experimenting with looks really
> > interesting and I'm certainly more than interested in looking at your
> > findings and patches when they will hit the list.
> > 
> > My point was more on what we can look at today, though. Without a clear
> > understanding about how and where util and max will be used and from
> > which scheduler paths such information should come from, it is a bit
> > difficult to tell if the current interface and hooks are fine, IMHO.
> 
> As I've just said, I may be able to show something shortly.
> 
> > I'd suggest we leave this part to the discussion we will have once your
> > proposal will be public; and to facilitate that we should remove those
> > arguments from the current interface.
> 
> I'm not really sure how this will help apart from removing some tiny extra
> overhead that is expected to be temporary anyway.
> 
> That said, since both you and Steve are making the point that the utilization
> arguments are problematic and I'd really like to be able to make progress here,
> I don't have any fundamental problems with dropping them for the time being,
> but I'm not going to rebase the 50+ commits I have queued up on top of the
> $subject patch.
> 
> So I can apply something like the appended patch if that helps to address
> your concerns.
> 
> Thanks,
> Rafael
> 
> 
> ---
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH] cpufreq: Rework the scheduler hooks for triggering updates
> 
> Commit fe7034338ba0 (cpufreq: Add mechanism for registering
> utilization update callbacks) added cpufreq_update_util() to be
> called by the scheduler (from the CFS part) on utilization updates.
> The goal was to allow CFS to pass utilization information to cpufreq
> and to trigger it to evaluate the frequency/voltage configuration
> (P-state) of every CPU on a regular basis.
> 
> However, the last two arguments of that function are never used by
> the current code, so CFS might simply call cpufreq_trigger_update()
> instead of it.
> 
> For this reason, drop the last two arguments of cpufreq_update_util(),
> rename it to cpufreq_trigger_update() and modify CFS to call it.
> 
> Moreover, since the utilization is not involved in that now, rename
> data types, functions and variables related to cpufreq_trigger_update()
> to reflect that (eg. struct update_util_data becomes struct
> freq_update_hook and so on).
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

This patch looks good to me. I didn't yet test it, but it shouldn't
break things AFAICT.

Thanks a lot for taking the time for this cleanup.

Best,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-19 22:26                     ` Rafael J. Wysocki
  2016-02-22  9:42                       ` Juri Lelli
@ 2016-02-22 10:45                       ` Viresh Kumar
  1 sibling, 0 replies; 134+ messages in thread
From: Viresh Kumar @ 2016-02-22 10:45 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Juri Lelli, Srinivas Pandruvada, Rafael J. Wysocki, Linux PM list,
	Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Steve Muckle, Thomas Gleixner

On 19-02-16, 23:26, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH] cpufreq: Rework the scheduler hooks for triggering updates
> 
> Commit fe7034338ba0 (cpufreq: Add mechanism for registering
> utilization update callbacks) added cpufreq_update_util() to be
> called by the scheduler (from the CFS part) on utilization updates.
> The goal was to allow CFS to pass utilization information to cpufreq
> and to trigger it to evaluate the frequency/voltage configuration
> (P-state) of every CPU on a regular basis.
> 
> However, the last two arguments of that function are never used by
> the current code, so CFS might simply call cpufreq_trigger_update()
> instead of it.
> 
> For this reason, drop the last two arguments of cpufreq_update_util(),
> rename it to cpufreq_trigger_update() and modify CFS to call it.
> 
> Moreover, since the utilization is not involved in that now, rename
> data types, functions and variables related to cpufreq_trigger_update()
> to reflect that (eg. struct update_util_data becomes struct
> freq_update_hook and so on).
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpufreq/cpufreq.c          |   48 ++++++++++++++++++++++---------------
>  drivers/cpufreq/cpufreq_governor.c |   27 ++++++++++----------
>  drivers/cpufreq/cpufreq_governor.h |    2 -
>  drivers/cpufreq/intel_pstate.c     |   15 +++++------
>  include/linux/cpufreq.h            |   32 +++---------------------
>  kernel/sched/deadline.c            |    2 -
>  kernel/sched/fair.c                |   13 +---------
>  kernel/sched/rt.c                  |    2 -
>  8 files changed, 58 insertions(+), 83 deletions(-)

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-19 17:28                   ` Steve Muckle
  2016-02-19 22:35                     ` Rafael J. Wysocki
@ 2016-02-22 10:52                     ` Peter Zijlstra
  2016-02-22 14:33                       ` Vincent Guittot
                                         ` (2 more replies)
  1 sibling, 3 replies; 134+ messages in thread
From: Peter Zijlstra @ 2016-02-22 10:52 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Srinivas Pandruvada, Juri Lelli, Rafael J. Wysocki, Linux PM list,
	Ingo Molnar, Linux Kernel Mailing List, Viresh Kumar,
	Thomas Gleixner, Rafael J. Wysocki

On Fri, Feb 19, 2016 at 09:28:23AM -0800, Steve Muckle wrote:
> On 02/19/2016 08:42 AM, Srinivas Pandruvada wrote:
> > We did experiments using util/max in intel_pstate. For some benchmarks
> > there were regression of 4 to 5%, for some benchmarks it performed at
> > par with getting utilization from the processor. Further optimization
> > in the algorithm is possible and still in progress. Idea is that we can
> > change P-State fast enough and be more reactive. Once I have good data,
> > I will send to this list. The algorithm can be part of the cpufreq
> > governor too.
> 
> There has been a lot of work in the area of scheduler-driven CPU
> frequency selection by Linaro and ARM as well. It was posted most
> recently a couple months ago:
> 
> http://thread.gmane.org/gmane.linux.power-management.general/69176
> 
> It was also posted as part of the energy-aware scheduling series last
> July. There's a new RFC series forthcoming which I had hoped (and
> failed) to post prior to my business travel this week; it should be out
> next week. It will address the feedback received thus far along with
> locking and other things.

Right, so I had a wee look at that again, and had a quick chat with Juri
on IRC. So the main difference seems to be that you guys want to know
why the utilization changed, as opposed to purely _that_ it changed.

And hence you have callbacks all over the place.

I'm not too sure I really like that too much, it bloats the code and
somewhat obfuscates the point.

So I would really like there to be just the one callback when we
actually compute a new number, and that is update_load_avg().

Now I think we can 'easily' propagate the information you want into
update_load_avg() (see below), but I would like to see actual arguments
for why you would need this.

For one, the migration bits don't really make sense. We typically do not
call migration code local on both cpus, typically just one, but possibly
neither. That means you cannot actually update the relevant CPU state
from these sites anyway.

> The scheduler hooks for utilization-based cpufreq operation deserve a
> lot more debate I think. They could quite possibly have different
> requirements than hooks which are chosen just to guarantee periodic
> callbacks into sampling-based governors.

I'll repeat what Rafael said, the periodic callback nature is a
'temporary' hack, simply because current cpufreq depends on that.

The idea is to wane cpufreq off of that requirement and then drop that
part.

Very-much-not-signed-off-by: Peter Zijlstra
---
 kernel/sched/fair.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce24a456322..f3e95d8b65c3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2528,6 +2528,17 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+enum load_update_type {
+	LOAD_NONE,
+	LOAD_TICK,
+	LOAD_PUT,
+	LOAD_SET,
+	LOAD_ENQUEUE,
+	LOAD_DEQUEUE,
+	LOAD_ENQUEUE_MOVE = LOAD_ENQUEUE + 2,
+	LOAD_DEQUEUE_MOVE = LOAD_DEQUEUE + 2,
+};
+
 #ifdef CONFIG_SMP
 /* Precomputed fixed inverse multiplies for multiplication by y^n */
 static const u32 runnable_avg_yN_inv[] = {
@@ -2852,7 +2863,8 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 }
 
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int update_tg)
+static inline void update_load_avg(struct sched_entity *se, int update_tg,
+				   enum load_update_type type)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
@@ -2940,7 +2952,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static inline void
 dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_avg(se, 1);
+	update_load_avg(se, 1, LOAD_DEQUEUE);
 
 	cfs_rq->runnable_load_avg =
 		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
@@ -3006,7 +3018,8 @@ static int idle_balance(struct rq *this_rq);
 
 #else /* CONFIG_SMP */
 
-static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
+static inline void update_load_avg(struct sched_entity *se, int update_tg,
+				   enum load_update_type type) {}
 static inline void
 enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void
@@ -3327,7 +3340,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		if (schedstat_enabled())
 			update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(se, 1);
+		update_load_avg(se, 1, LOAD_SET);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -3431,7 +3444,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_load_avg(prev, 0);
+		update_load_avg(prev, 0, LOAD_PUT);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -3447,7 +3460,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(curr, 1);
+	update_load_avg(curr, 1, LOAD_TICK);
 	update_cfs_shares(cfs_rq);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -4320,7 +4333,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_load_avg(se, 1, LOAD_ENQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -4380,7 +4393,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(se, 1);
+		update_load_avg(se, 1, LOAD_DEQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
 		update_cfs_shares(cfs_rq);
 	}
 

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-22 10:52                     ` Peter Zijlstra
@ 2016-02-22 14:33                       ` Vincent Guittot
  2016-02-22 15:31                         ` Peter Zijlstra
  2016-02-22 14:40                       ` Juri Lelli
  2016-02-22 21:46                       ` Rafael J. Wysocki
  2 siblings, 1 reply; 134+ messages in thread
From: Vincent Guittot @ 2016-02-22 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Srinivas Pandruvada, Juri Lelli, Rafael J. Wysocki,
	Linux PM list, Ingo Molnar, Linux Kernel Mailing List,
	Viresh Kumar, Thomas Gleixner, Rafael J. Wysocki

On 22 February 2016 at 11:52, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Feb 19, 2016 at 09:28:23AM -0800, Steve Muckle wrote:
>> On 02/19/2016 08:42 AM, Srinivas Pandruvada wrote:
>> > We did experiments using util/max in intel_pstate. For some benchmarks
>> > there were regression of 4 to 5%, for some benchmarks it performed at
>> > par with getting utilization from the processor. Further optimization
>> > in the algorithm is possible and still in progress. Idea is that we can
>> > change P-State fast enough and be more reactive. Once I have good data,
>> > I will send to this list. The algorithm can be part of the cpufreq
>> > governor too.
>>
>> There has been a lot of work in the area of scheduler-driven CPU
>> frequency selection by Linaro and ARM as well. It was posted most
>> recently a couple months ago:
>>
>> http://thread.gmane.org/gmane.linux.power-management.general/69176
>>
>> It was also posted as part of the energy-aware scheduling series last
>> July. There's a new RFC series forthcoming which I had hoped (and
>> failed) to post prior to my business travel this week; it should be out
>> next week. It will address the feedback received thus far along with
>> locking and other things.
>
> Right, so I had a wee look at that again, and had a quick chat with Juri
> on IRC. So the main difference seems to be that you guys want to know
> why the utilization changed, as opposed to purely _that_ it changed.

Yes, the main goal was to be able to filter the useful and useless
update of rq's utilization in order to minimize/optimize the trig of
an update of the frequency. These patches have been made for a cpufreq
driver that reacts far slower than scheduler. It's might worth
starting with a simple solution and update it after

>
> And hence you have callbacks all over the place.
>
> I'm not too sure I really like that too much, it bloats the code and
> somewhat obfuscates the point.
>
> So I would really like there to be just the one callback when we
> actually compute a new number, and that is update_load_avg().
>
> Now I think we can 'easily' propagate the information you want into
> update_load_avg() (see below), but I would like to see actual arguments
> for why you would need this.

Your proposal is interesting except that we are interested in the rq's
utilization more that se's ones so we should better use
update_cfs_rq_load_avg and few additional place like
attach_entity_load_avg which bypasses update_cfs_rq_load_avg to update
rq's utilization and load

>
> For one, the migration bits don't really make sense. We typically do not
> call migration code local on both cpus, typically just one, but possibly
> neither. That means you cannot actually update the relevant CPU state
> from these sites anyway.
>
>> The scheduler hooks for utilization-based cpufreq operation deserve a
>> lot more debate I think. They could quite possibly have different
>> requirements than hooks which are chosen just to guarantee periodic
>> callbacks into sampling-based governors.
>
> I'll repeat what Rafael said, the periodic callback nature is a
> 'temporary' hack, simply because current cpufreq depends on that.
>
> The idea is to wane cpufreq off of that requirement and then drop that
> part.
>
> Very-much-not-signed-off-by: Peter Zijlstra
> ---
>  kernel/sched/fair.c | 29 +++++++++++++++++++++--------
>  1 file changed, 21 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7ce24a456322..f3e95d8b65c3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2528,6 +2528,17 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
>  }
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> +enum load_update_type {
> +       LOAD_NONE,
> +       LOAD_TICK,
> +       LOAD_PUT,
> +       LOAD_SET,
> +       LOAD_ENQUEUE,
> +       LOAD_DEQUEUE,
> +       LOAD_ENQUEUE_MOVE = LOAD_ENQUEUE + 2,
> +       LOAD_DEQUEUE_MOVE = LOAD_DEQUEUE + 2,
> +};
> +
>  #ifdef CONFIG_SMP
>  /* Precomputed fixed inverse multiplies for multiplication by y^n */
>  static const u32 runnable_avg_yN_inv[] = {
> @@ -2852,7 +2863,8 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  }
>
>  /* Update task and its cfs_rq load average */
> -static inline void update_load_avg(struct sched_entity *se, int update_tg)
> +static inline void update_load_avg(struct sched_entity *se, int update_tg,
> +                                  enum load_update_type type)
>  {
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
>         u64 now = cfs_rq_clock_task(cfs_rq);
> @@ -2940,7 +2952,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  static inline void
>  dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       update_load_avg(se, 1);
> +       update_load_avg(se, 1, LOAD_DEQUEUE);
>
>         cfs_rq->runnable_load_avg =
>                 max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
> @@ -3006,7 +3018,8 @@ static int idle_balance(struct rq *this_rq);
>
>  #else /* CONFIG_SMP */
>
> -static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
> +static inline void update_load_avg(struct sched_entity *se, int update_tg,
> +                                  enum load_update_type type) {}
>  static inline void
>  enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
>  static inline void
> @@ -3327,7 +3340,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>                 if (schedstat_enabled())
>                         update_stats_wait_end(cfs_rq, se);
>                 __dequeue_entity(cfs_rq, se);
> -               update_load_avg(se, 1);
> +               update_load_avg(se, 1, LOAD_SET);
>         }
>
>         update_stats_curr_start(cfs_rq, se);
> @@ -3431,7 +3444,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
>                 /* Put 'current' back into the tree. */
>                 __enqueue_entity(cfs_rq, prev);
>                 /* in !on_rq case, update occurred at dequeue */
> -               update_load_avg(prev, 0);
> +               update_load_avg(prev, 0, LOAD_PUT);
>         }
>         cfs_rq->curr = NULL;
>  }
> @@ -3447,7 +3460,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>         /*
>          * Ensure that runnable average is periodically updated.
>          */
> -       update_load_avg(curr, 1);
> +       update_load_avg(curr, 1, LOAD_TICK);
>         update_cfs_shares(cfs_rq);
>
>  #ifdef CONFIG_SCHED_HRTICK
> @@ -4320,7 +4333,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 if (cfs_rq_throttled(cfs_rq))
>                         break;
>
> -               update_load_avg(se, 1);
> +               update_load_avg(se, 1, LOAD_ENQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
>                 update_cfs_shares(cfs_rq);
>         }
>
> @@ -4380,7 +4393,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 if (cfs_rq_throttled(cfs_rq))
>                         break;
>
> -               update_load_avg(se, 1);
> +               update_load_avg(se, 1, LOAD_DEQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
>                 update_cfs_shares(cfs_rq);
>         }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-22 10:52                     ` Peter Zijlstra
  2016-02-22 14:33                       ` Vincent Guittot
@ 2016-02-22 14:40                       ` Juri Lelli
  2016-02-22 15:42                         ` Peter Zijlstra
  2016-02-22 21:46                       ` Rafael J. Wysocki
  2 siblings, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-02-22 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Srinivas Pandruvada, Rafael J. Wysocki,
	Linux PM list, Ingo Molnar, Linux Kernel Mailing List,
	Viresh Kumar, Thomas Gleixner, Rafael J. Wysocki

Hi Peter,

On 22/02/16 11:52, Peter Zijlstra wrote:
> On Fri, Feb 19, 2016 at 09:28:23AM -0800, Steve Muckle wrote:
> > On 02/19/2016 08:42 AM, Srinivas Pandruvada wrote:
> > > We did experiments using util/max in intel_pstate. For some benchmarks
> > > there were regression of 4 to 5%, for some benchmarks it performed at
> > > par with getting utilization from the processor. Further optimization
> > > in the algorithm is possible and still in progress. Idea is that we can
> > > change P-State fast enough and be more reactive. Once I have good data,
> > > I will send to this list. The algorithm can be part of the cpufreq
> > > governor too.
> > 
> > There has been a lot of work in the area of scheduler-driven CPU
> > frequency selection by Linaro and ARM as well. It was posted most
> > recently a couple months ago:
> > 
> > http://thread.gmane.org/gmane.linux.power-management.general/69176
> > 
> > It was also posted as part of the energy-aware scheduling series last
> > July. There's a new RFC series forthcoming which I had hoped (and
> > failed) to post prior to my business travel this week; it should be out
> > next week. It will address the feedback received thus far along with
> > locking and other things.
> 
> Right, so I had a wee look at that again, and had a quick chat with Juri
> on IRC. So the main difference seems to be that you guys want to know
> why the utilization changed, as opposed to purely _that_ it changed.
> 
> And hence you have callbacks all over the place.
> 
> I'm not too sure I really like that too much, it bloats the code and
> somewhat obfuscates the point.
> 
> So I would really like there to be just the one callback when we
> actually compute a new number, and that is update_load_avg().
> 
> Now I think we can 'easily' propagate the information you want into
> update_load_avg() (see below), but I would like to see actual arguments
> for why you would need this.
> 

Right. The information we propagate with your patch might be all we
need, but I'll have to play with it on top of Rafael's or Steve's
changes to fully convince myself. :-)

> For one, the migration bits don't really make sense. We typically do not
> call migration code local on both cpus, typically just one, but possibly
> neither. That means you cannot actually update the relevant CPU state
> from these sites anyway.
> 

I might actually have one point regarding migrations. See below. And I'm
not sure I understand why you are saying that we can't update the
relevant CPU state on migrations; we do know src and dst cpus, don't we?

[...]

> @@ -4320,7 +4333,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		if (cfs_rq_throttled(cfs_rq))
>  			break;
>  
> -		update_load_avg(se, 1);
> +		update_load_avg(se, 1, LOAD_ENQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
>  		update_cfs_shares(cfs_rq);
>  	}
>  
> @@ -4380,7 +4393,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		if (cfs_rq_throttled(cfs_rq))
>  			break;
>  
> -		update_load_avg(se, 1);
> +		update_load_avg(se, 1, LOAD_DEQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
>  		update_cfs_shares(cfs_rq);
>  	}
>  

What we are trying to do with the sched-freq approach (and maybe that is
just broken :-/) is to wait until all tasks are detached from src cpu
and attached to dst cpu to trigger updates on such cpus. I fear that if
don't do that we might have problems with any sort of rate limiting for
freq transitions we might need to put in place.

Best,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-22 14:33                       ` Vincent Guittot
@ 2016-02-22 15:31                         ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2016-02-22 15:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Steve Muckle, Srinivas Pandruvada, Juri Lelli, Rafael J. Wysocki,
	Linux PM list, Ingo Molnar, Linux Kernel Mailing List,
	Viresh Kumar, Thomas Gleixner, Rafael J. Wysocki

On Mon, Feb 22, 2016 at 03:33:02PM +0100, Vincent Guittot wrote:

> > Right, so I had a wee look at that again, and had a quick chat with Juri
> > on IRC. So the main difference seems to be that you guys want to know
> > why the utilization changed, as opposed to purely _that_ it changed.
> 
> Yes, the main goal was to be able to filter the useful and useless
> update of rq's utilization in order to minimize/optimize the trig of
> an update of the frequency. These patches have been made for a cpufreq
> driver that reacts far slower than scheduler. It's might worth
> starting with a simple solution and update it after

Right, always start simple :-)

> > And hence you have callbacks all over the place.
> >
> > I'm not too sure I really like that too much, it bloats the code and
> > somewhat obfuscates the point.
> >
> > So I would really like there to be just the one callback when we
> > actually compute a new number, and that is update_load_avg().
> >
> > Now I think we can 'easily' propagate the information you want into
> > update_load_avg() (see below), but I would like to see actual arguments
> > for why you would need this.
> 
> Your proposal is interesting except that we are interested in the rq's
> utilization more that se's ones so we should better use
> update_cfs_rq_load_avg and few additional place like
> attach_entity_load_avg which bypasses update_cfs_rq_load_avg to update
> rq's utilization and load

Ah, so the intent was to use the rq->cfs util, but I might have gotten
a little lost in the load update code (I always get confused by that
code if I haven't looked at it for a while).

We can put the hook in update_cfs_rq_load_avg(), that shouldn't be a
problem.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2859,29 +2859,6 @@ static inline int update_cfs_rq_load_avg
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif
 
-	return decayed || removed;
-}
-
-/* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct sched_entity *se, int update_tg,
-				   enum load_update_type type)
-{
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 now = cfs_rq_clock_task(cfs_rq);
-	struct rq *rq = rq_of(cfs_rq);
-	int cpu = cpu_of(rq);
-
-	/*
-	 * Track task load average for carrying it to new CPU after migrated, and
-	 * track group sched_entity load average for task_h_load calc in migration
-	 */
-	__update_load_avg(now, cpu, &se->avg,
-			  se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
-
-	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
-		update_tg_load_avg(cfs_rq, 0);
-
 	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
 		unsigned long max = rq->cpu_capacity_orig;
 
@@ -2904,6 +2881,29 @@ static inline void update_load_avg(struc
 		cpufreq_update_util(rq_clock(rq),
 				    min(cfs_rq->avg.util_avg, max), max);
 	}
+
+	return decayed || removed;
+}
+
+/* Update task and its cfs_rq load average */
+static inline void update_load_avg(struct sched_entity *se, int update_tg,
+				   enum load_update_type type)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 now = cfs_rq_clock_task(cfs_rq);
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
+
+	/*
+	 * Track task load average for carrying it to new CPU after migrated, and
+	 * track group sched_entity load average for task_h_load calc in migration
+	 */
+	__update_load_avg(now, cpu, &se->avg,
+			  se->on_rq * scale_load_down(se->load.weight),
+			  cfs_rq->curr == se, NULL);
+
+	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
+		update_tg_load_avg(cfs_rq, 0);
 }
 
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-22 14:40                       ` Juri Lelli
@ 2016-02-22 15:42                         ` Peter Zijlstra
  0 siblings, 0 replies; 134+ messages in thread
From: Peter Zijlstra @ 2016-02-22 15:42 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steve Muckle, Srinivas Pandruvada, Rafael J. Wysocki,
	Linux PM list, Ingo Molnar, Linux Kernel Mailing List,
	Viresh Kumar, Thomas Gleixner, Rafael J. Wysocki

On Mon, Feb 22, 2016 at 02:40:01PM +0000, Juri Lelli wrote:

> > For one, the migration bits don't really make sense. We typically do not
> > call migration code local on both cpus, typically just one, but possibly
> > neither. That means you cannot actually update the relevant CPU state
> > from these sites anyway.
> > 
> 
> I might actually have one point regarding migrations. See below. And I'm
> not sure I understand why you are saying that we can't update the
> relevant CPU state on migrations; we do know src and dst cpus, don't we?
> 
> [...]
> 
> > @@ -4320,7 +4333,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  		if (cfs_rq_throttled(cfs_rq))
> >  			break;
> >  
> > -		update_load_avg(se, 1);
> > +		update_load_avg(se, 1, LOAD_ENQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
> >  		update_cfs_shares(cfs_rq);
> >  	}
> >  
> > @@ -4380,7 +4393,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  		if (cfs_rq_throttled(cfs_rq))
> >  			break;
> >  
> > -		update_load_avg(se, 1);
> > +		update_load_avg(se, 1, LOAD_DEQUEUE + (p->on_rq & TASK_ON_RQ_MIGRATING));
> >  		update_cfs_shares(cfs_rq);
> >  	}
> >  

Well, yes, you have the src and dst cpu numbers, but if you want to
access that data remotely you'll have to go add atomic ops or locking.

And you'll have to go trigger IPIs to program remote state (or wait for
the next event on the CPU).

That all is expensive.

> What we are trying to do with the sched-freq approach (and maybe that is
> just broken :-/) is to wait until all tasks are detached from src cpu
> and attached to dst cpu to trigger updates on such cpus. I fear that if
> don't do that we might have problems with any sort of rate limiting for
> freq transitions we might need to put in place.

Hurm.. tricky that :-)

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-22  9:32                   ` Juri Lelli
@ 2016-02-22 21:26                     ` Rafael J. Wysocki
  2016-02-23 11:01                       ` Juri Lelli
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-22 21:26 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM list,
	Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Steve Muckle, Thomas Gleixner

On Mon, Feb 22, 2016 at 10:32 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> On 19/02/16 23:14, Rafael J. Wysocki wrote:
>> On Friday, February 19, 2016 08:09:17 AM Juri Lelli wrote:
>> > Hi Rafael,
>> >
>> > On 18/02/16 21:22, Rafael J. Wysocki wrote:
>> > > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> > > >

[cut]

>> That said, if the concern is that there are plans to change the way the
>> scheduler computes the utilization numbers and that may become difficult to
>> carry out if cpufreq starts to depend on them in their current form, then I
>> may agree that it is valid, but I'm not aware of those plans ATM.
>>
>
> No, I don't think there's any substantial discussion going on about the
> utilization numbers.

OK, so the statement below applies.

>> However, if the numbers are going to stay what they are, I don't see why
>> passing them to cpufreq may possibly become problematic at any point.
>
> My concern was mostly on the fact that there is already another RFC
> under discussion that uses the same numbers and has different hooks
> placed in scheduler code (Steve's sched-freq); so, additional hooks
> might generate confusion, IMHO.

So this is about the hooks rather than about their arguments after
all, isn't it?

I fail to see why it is better to drop the arguments and leave the hooks, then.

OTOH, I see reasons for keeping the arguments along with the hooks,
but let me address that in my next reply.

Now, if the call sites of the hooks change in the future, it won't be
a problem for me as long as the new hooks are invoked on a regular
basis or, if they aren't, as long as I can figure out from the
arguments they pass that I should not expect an update any time soon.

If the arguments change, it won't be a problem either as long as they
are sufficient to be inserted into the frequency selection formula
used by the schedutil governor I posted and produce sensible
frequencies for the CPU.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-22  9:42                       ` Juri Lelli
@ 2016-02-22 21:41                         ` Rafael J. Wysocki
  2016-02-23 11:10                           ` Juri Lelli
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-22 21:41 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Rafael J. Wysocki,
	Linux PM list, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Viresh Kumar, Steve Muckle,
	Thomas Gleixner

On Mon, Feb 22, 2016 at 10:42 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> Hi Rafael,
>
> On 19/02/16 23:26, Rafael J. Wysocki wrote:
>> On Friday, February 19, 2016 05:26:04 PM Juri Lelli wrote:
>> > Hi Srinivas,

[cut]

>> ---
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Subject: [PATCH] cpufreq: Rework the scheduler hooks for triggering updates
>>
>> Commit fe7034338ba0 (cpufreq: Add mechanism for registering
>> utilization update callbacks) added cpufreq_update_util() to be
>> called by the scheduler (from the CFS part) on utilization updates.
>> The goal was to allow CFS to pass utilization information to cpufreq
>> and to trigger it to evaluate the frequency/voltage configuration
>> (P-state) of every CPU on a regular basis.
>>
>> However, the last two arguments of that function are never used by
>> the current code, so CFS might simply call cpufreq_trigger_update()
>> instead of it.
>>
>> For this reason, drop the last two arguments of cpufreq_update_util(),
>> rename it to cpufreq_trigger_update() and modify CFS to call it.
>>
>> Moreover, since the utilization is not involved in that now, rename
>> data types, functions and variables related to cpufreq_trigger_update()
>> to reflect that (eg. struct update_util_data becomes struct
>> freq_update_hook and so on).
>>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> This patch looks good to me. I didn't yet test it, but it shouldn't
> break things AFAICT.
>
> Thanks a lot for taking the time for this cleanup.

Alas, I don't think I will apply it.

Peter says that he wants the arguments to stay and he has a point IMO.

The very idea behind hooking up cpufreq to the scheduler through those
hooks has always been to make it possible to use the utilization
information provided by the scheduler in cpufreq.  As it turns out, we
can make significant improvements even *without* using that
information, because just having the hooks in there alone makes it
possible to simplify the code quite a bit in general and make it more
straightforward, but that's a *bonus* and not the objective. :-)

The objective still is to use the utilization numbers from the scheduler.

Both sched-freq and my approach agree on that, so I don't quite see
why I should pretend that this isn't the case now?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-22 10:52                     ` Peter Zijlstra
  2016-02-22 14:33                       ` Vincent Guittot
  2016-02-22 14:40                       ` Juri Lelli
@ 2016-02-22 21:46                       ` Rafael J. Wysocki
  2 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-22 21:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Srinivas Pandruvada, Juri Lelli, Rafael J. Wysocki,
	Linux PM list, Ingo Molnar, Linux Kernel Mailing List,
	Viresh Kumar, Thomas Gleixner, Rafael J. Wysocki

On Mon, Feb 22, 2016 at 11:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Feb 19, 2016 at 09:28:23AM -0800, Steve Muckle wrote:
>> On 02/19/2016 08:42 AM, Srinivas Pandruvada wrote:
>> > We did experiments using util/max in intel_pstate. For some benchmarks
>> > there were regression of 4 to 5%, for some benchmarks it performed at
>> > par with getting utilization from the processor. Further optimization
>> > in the algorithm is possible and still in progress. Idea is that we can
>> > change P-State fast enough and be more reactive. Once I have good data,
>> > I will send to this list. The algorithm can be part of the cpufreq
>> > governor too.
>>
>> There has been a lot of work in the area of scheduler-driven CPU
>> frequency selection by Linaro and ARM as well. It was posted most
>> recently a couple months ago:
>>
>> http://thread.gmane.org/gmane.linux.power-management.general/69176
>>
>> It was also posted as part of the energy-aware scheduling series last
>> July. There's a new RFC series forthcoming which I had hoped (and
>> failed) to post prior to my business travel this week; it should be out
>> next week. It will address the feedback received thus far along with
>> locking and other things.
>
> Right, so I had a wee look at that again, and had a quick chat with Juri
> on IRC. So the main difference seems to be that you guys want to know
> why the utilization changed, as opposed to purely _that_ it changed.
>
> And hence you have callbacks all over the place.
>
> I'm not too sure I really like that too much, it bloats the code and
> somewhat obfuscates the point.
>
> So I would really like there to be just the one callback when we
> actually compute a new number, and that is update_load_avg().
>
> Now I think we can 'easily' propagate the information you want into
> update_load_avg() (see below), but I would like to see actual arguments
> for why you would need this.
>
> For one, the migration bits don't really make sense. We typically do not
> call migration code local on both cpus, typically just one, but possibly
> neither. That means you cannot actually update the relevant CPU state
> from these sites anyway.
>
>> The scheduler hooks for utilization-based cpufreq operation deserve a
>> lot more debate I think. They could quite possibly have different
>> requirements than hooks which are chosen just to guarantee periodic
>> callbacks into sampling-based governors.
>
> I'll repeat what Rafael said, the periodic callback nature is a
> 'temporary' hack, simply because current cpufreq depends on that.
>
> The idea is to wane cpufreq off of that requirement and then drop that
> part.

Right and I can see at least a couple of ways to do that, but it'll
depend on where the final hooks will be located and what arguments
they will pass.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-19 22:35                     ` Rafael J. Wysocki
@ 2016-02-23  3:58                       ` Steve Muckle
  0 siblings, 0 replies; 134+ messages in thread
From: Steve Muckle @ 2016-02-23  3:58 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Srinivas Pandruvada, Juri Lelli, Rafael J. Wysocki, Linux PM list,
	Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Viresh Kumar, Thomas Gleixner

On 02/19/2016 02:35 PM, Rafael J. Wysocki wrote:
>> The scheduler hooks for utilization-based cpufreq operation deserve a
>> lot more debate I think. They could quite possibly have different
>> requirements than hooks which are chosen just to guarantee periodic
>> callbacks into sampling-based governors.
> 
> Yes, they could.
> 
> The point here, though, is that even the sampling-based governors may 
> benefit from using the numbers provided by the scheduler instead of trying
> to come up with analogous numbers themselves.

It seems premature to me to merge supporting infrastructure (the
utilization hooks) before we have changes, be it modifications to the
sampling based governors to use utilization or a scheduler-guided
governor, which are well tested and proven to yield reasonable
performance and power across various platforms and workloads.

Perhaps I'm a pessimist but I think it's going to be a challenge to get
utilization-based cpufreq on par, and I think getting the hooks right
will be part of that challenge.

>> For my part I think it would be best if the util/max parameters are
>> omitted
> 
> OK, so please see the patch I've just sent to Juri:
> 
> https://patchwork.kernel.org/patch/8364621/

Looked good to me.

thanks,
Steve

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-22 21:26                     ` Rafael J. Wysocki
@ 2016-02-23 11:01                       ` Juri Lelli
  2016-02-24  2:01                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-02-23 11:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Linux PM list, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Steve Muckle, Thomas Gleixner

On 22/02/16 22:26, Rafael J. Wysocki wrote:
> On Mon, Feb 22, 2016 at 10:32 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > On 19/02/16 23:14, Rafael J. Wysocki wrote:
> >> On Friday, February 19, 2016 08:09:17 AM Juri Lelli wrote:
> >> > Hi Rafael,
> >> >
> >> > On 18/02/16 21:22, Rafael J. Wysocki wrote:
> >> > > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> > > >
> 
> [cut]
> 
> >> That said, if the concern is that there are plans to change the way the
> >> scheduler computes the utilization numbers and that may become difficult to
> >> carry out if cpufreq starts to depend on them in their current form, then I
> >> may agree that it is valid, but I'm not aware of those plans ATM.
> >>
> >
> > No, I don't think there's any substantial discussion going on about the
> > utilization numbers.
> 
> OK, so the statement below applies.
> 
> >> However, if the numbers are going to stay what they are, I don't see why
> >> passing them to cpufreq may possibly become problematic at any point.
> >
> > My concern was mostly on the fact that there is already another RFC
> > under discussion that uses the same numbers and has different hooks
> > placed in scheduler code (Steve's sched-freq); so, additional hooks
> > might generate confusion, IMHO.
> 
> So this is about the hooks rather than about their arguments after
> all, isn't it?
> 
> I fail to see why it is better to drop the arguments and leave the hooks, then.
> 

It's about where we place such hooks and what arguments they have.
Without the schedutil governor as a consumer the current position makes
sense, but some of the arguments are not used. With schedutil both
position and arguments make sense, but a different implementation
(sched-freq) might have different needs w.r.t. position and arguments.

> OTOH, I see reasons for keeping the arguments along with the hooks,
> but let me address that in my next reply.
> 
> Now, if the call sites of the hooks change in the future, it won't be
> a problem for me as long as the new hooks are invoked on a regular
> basis or, if they aren't, as long as I can figure out from the
> arguments they pass that I should not expect an update any time soon.
> 

OK.

> If the arguments change, it won't be a problem either as long as they
> are sufficient to be inserted into the frequency selection formula
> used by the schedutil governor I posted and produce sensible
> frequencies for the CPU.
> 

Right, I guess this applies to any kind of governor.

Best,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-22 21:41                         ` Rafael J. Wysocki
@ 2016-02-23 11:10                           ` Juri Lelli
  2016-02-24  1:52                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-02-23 11:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Linux PM list,
	Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Viresh Kumar, Steve Muckle, Thomas Gleixner

On 22/02/16 22:41, Rafael J. Wysocki wrote:
> On Mon, Feb 22, 2016 at 10:42 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > Hi Rafael,
> >
> > On 19/02/16 23:26, Rafael J. Wysocki wrote:
> >> On Friday, February 19, 2016 05:26:04 PM Juri Lelli wrote:
> >> > Hi Srinivas,
> 
> [cut]
> 
> >> ---
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> Subject: [PATCH] cpufreq: Rework the scheduler hooks for triggering updates
> >>
> >> Commit fe7034338ba0 (cpufreq: Add mechanism for registering
> >> utilization update callbacks) added cpufreq_update_util() to be
> >> called by the scheduler (from the CFS part) on utilization updates.
> >> The goal was to allow CFS to pass utilization information to cpufreq
> >> and to trigger it to evaluate the frequency/voltage configuration
> >> (P-state) of every CPU on a regular basis.
> >>
> >> However, the last two arguments of that function are never used by
> >> the current code, so CFS might simply call cpufreq_trigger_update()
> >> instead of it.
> >>
> >> For this reason, drop the last two arguments of cpufreq_update_util(),
> >> rename it to cpufreq_trigger_update() and modify CFS to call it.
> >>
> >> Moreover, since the utilization is not involved in that now, rename
> >> data types, functions and variables related to cpufreq_trigger_update()
> >> to reflect that (eg. struct update_util_data becomes struct
> >> freq_update_hook and so on).
> >>
> >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > This patch looks good to me. I didn't yet test it, but it shouldn't
> > break things AFAICT.
> >
> > Thanks a lot for taking the time for this cleanup.
> 
> Alas, I don't think I will apply it.
> 
> Peter says that he wants the arguments to stay and he has a point IMO.
> 
> The very idea behind hooking up cpufreq to the scheduler through those
> hooks has always been to make it possible to use the utilization
> information provided by the scheduler in cpufreq.  As it turns out, we
> can make significant improvements even *without* using that
> information, because just having the hooks in there alone makes it
> possible to simplify the code quite a bit in general and make it more
> straightforward, but that's a *bonus* and not the objective. :-)
> 
> The objective still is to use the utilization numbers from the scheduler.
> 
> Both sched-freq and my approach agree on that, so I don't quite see
> why I should pretend that this isn't the case now?
> 

As I said in the other reply, I'm not at all against having cpufreq
hooks in the scheduler. I was only wondering if deciding where such
hooks reside and which interface they have before we agreed on how they
will be used might cause problems in the future. :-)

Best,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-23 11:10                           ` Juri Lelli
@ 2016-02-24  1:52                             ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-24  1:52 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Linux PM list,
	Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Viresh Kumar, Steve Muckle, Thomas Gleixner

On Tuesday, February 23, 2016 11:10:07 AM Juri Lelli wrote:
> On 22/02/16 22:41, Rafael J. Wysocki wrote:
> > On Mon, Feb 22, 2016 at 10:42 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > > Hi Rafael,
> > >
> > > On 19/02/16 23:26, Rafael J. Wysocki wrote:
> > >> On Friday, February 19, 2016 05:26:04 PM Juri Lelli wrote:
> > >> > Hi Srinivas,
> > 
> > [cut]
> > 
> > >> ---
> > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >> Subject: [PATCH] cpufreq: Rework the scheduler hooks for triggering updates
> > >>
> > >> Commit fe7034338ba0 (cpufreq: Add mechanism for registering
> > >> utilization update callbacks) added cpufreq_update_util() to be
> > >> called by the scheduler (from the CFS part) on utilization updates.
> > >> The goal was to allow CFS to pass utilization information to cpufreq
> > >> and to trigger it to evaluate the frequency/voltage configuration
> > >> (P-state) of every CPU on a regular basis.
> > >>
> > >> However, the last two arguments of that function are never used by
> > >> the current code, so CFS might simply call cpufreq_trigger_update()
> > >> instead of it.
> > >>
> > >> For this reason, drop the last two arguments of cpufreq_update_util(),
> > >> rename it to cpufreq_trigger_update() and modify CFS to call it.
> > >>
> > >> Moreover, since the utilization is not involved in that now, rename
> > >> data types, functions and variables related to cpufreq_trigger_update()
> > >> to reflect that (eg. struct update_util_data becomes struct
> > >> freq_update_hook and so on).
> > >>
> > >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > This patch looks good to me. I didn't yet test it, but it shouldn't
> > > break things AFAICT.
> > >
> > > Thanks a lot for taking the time for this cleanup.
> > 
> > Alas, I don't think I will apply it.
> > 
> > Peter says that he wants the arguments to stay and he has a point IMO.
> > 
> > The very idea behind hooking up cpufreq to the scheduler through those
> > hooks has always been to make it possible to use the utilization
> > information provided by the scheduler in cpufreq.  As it turns out, we
> > can make significant improvements even *without* using that
> > information, because just having the hooks in there alone makes it
> > possible to simplify the code quite a bit in general and make it more
> > straightforward, but that's a *bonus* and not the objective. :-)
> > 
> > The objective still is to use the utilization numbers from the scheduler.
> > 
> > Both sched-freq and my approach agree on that, so I don't quite see
> > why I should pretend that this isn't the case now?
> > 
> 
> As I said in the other reply, I'm not at all against having cpufreq
> hooks in the scheduler. I was only wondering if deciding where such
> hooks reside and which interface they have before we agreed on how they
> will be used might cause problems in the future. :-)

And I have said for a few times that I don't quite see what problems exactly
those might be.

Also having the hooks in there and with the util and max arguments allows
everybody to play with them and see what can be done and whether or not they
are suitable for particular purposes.  I've already sufficiently demonstrated
that they are generally useful I think.

And if they aren't suitable for a particular purpose, one can try different
types of changes and see what looks good, what's practical and what's not etc.
intel_pstate can do that, you can do that, I can look at that from the existing
cpufreq governors perspective and so on.  In other words, it facilitates future
development.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-23 11:01                       ` Juri Lelli
@ 2016-02-24  2:01                         ` Rafael J. Wysocki
  2016-03-08 19:24                           ` Michael Turquette
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-02-24  2:01 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Rafael J. Wysocki, Linux PM list, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Steve Muckle, Thomas Gleixner

On Tuesday, February 23, 2016 11:01:18 AM Juri Lelli wrote:
> On 22/02/16 22:26, Rafael J. Wysocki wrote:
> > On Mon, Feb 22, 2016 at 10:32 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > > On 19/02/16 23:14, Rafael J. Wysocki wrote:
> > >> On Friday, February 19, 2016 08:09:17 AM Juri Lelli wrote:
> > >> > Hi Rafael,
> > >> >
> > >> > On 18/02/16 21:22, Rafael J. Wysocki wrote:
> > >> > > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >> > > >
> > 
> > [cut]
> > 
> > >> That said, if the concern is that there are plans to change the way the
> > >> scheduler computes the utilization numbers and that may become difficult to
> > >> carry out if cpufreq starts to depend on them in their current form, then I
> > >> may agree that it is valid, but I'm not aware of those plans ATM.
> > >>
> > >
> > > No, I don't think there's any substantial discussion going on about the
> > > utilization numbers.
> > 
> > OK, so the statement below applies.
> > 
> > >> However, if the numbers are going to stay what they are, I don't see why
> > >> passing them to cpufreq may possibly become problematic at any point.
> > >
> > > My concern was mostly on the fact that there is already another RFC
> > > under discussion that uses the same numbers and has different hooks
> > > placed in scheduler code (Steve's sched-freq); so, additional hooks
> > > might generate confusion, IMHO.
> > 
> > So this is about the hooks rather than about their arguments after
> > all, isn't it?
> > 
> > I fail to see why it is better to drop the arguments and leave the hooks, then.
> > 
> 
> It's about where we place such hooks and what arguments they have.
> Without the schedutil governor as a consumer the current position makes
> sense, but some of the arguments are not used. With schedutil both
> position and arguments make sense, but a different implementation
> (sched-freq) might have different needs w.r.t. position and arguments.

And that's fine.  If the current position and/or arguments are not suitable,
they'll need to be changed.  It's not like things introduced today are set
in stone forever.

Peter has already shown how they may be changed to make everyone happy,
so I don't really see what the fuss is about.

> > OTOH, I see reasons for keeping the arguments along with the hooks,
> > but let me address that in my next reply.
> > 
> > Now, if the call sites of the hooks change in the future, it won't be
> > a problem for me as long as the new hooks are invoked on a regular
> > basis or, if they aren't, as long as I can figure out from the
> > arguments they pass that I should not expect an update any time soon.
> > 
> 
> OK.
> 
> > If the arguments change, it won't be a problem either as long as they
> > are sufficient to be inserted into the frequency selection formula
> > used by the schedutil governor I posted and produce sensible
> > frequencies for the CPU.
> > 
> 
> Right, I guess this applies to any kind of governor.

Sure, but this particular formula is very simple.  It just assumes that
util <= max so dividing the former by the latter will always yield a number
between 0 and 1.  [And the interpretation of util > max is totally arbitrary
today and regarded as temporary anyway, so that's just irrelevant.]

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-02-12 14:48                       ` Vincent Guittot
@ 2016-03-01 13:58                         ` Peter Zijlstra
  2016-03-01 14:17                           ` Juri Lelli
  2016-03-01 14:58                           ` Vincent Guittot
  0 siblings, 2 replies; 134+ messages in thread
From: Peter Zijlstra @ 2016-03-01 13:58 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Juri Lelli, Steve Muckle, Rafael J. Wysocki, Rafael J. Wysocki,
	Linux PM list, Linux Kernel Mailing List, Srinivas Pandruvada,
	Viresh Kumar, Thomas Gleixner

On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:

> Another point to take into account is that the RT tasks will "steal"
> the compute capacity that has been requested by the cfs tasks.
> 
> Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
> and B and 1 cfs task C.

> Let assume that the real time constraint of RT task A is too agressive
> for the lowest OPP0 and that the change of the frequency of the core
> is too slow compare to this constraint but the real time constraint of
> RT task B can be handle whatever the OPP. System don't have other
> choice than setting the cpufreq min freq to OPP1 to be sure that
> constraint of task A will be covered at anytime.

> Then, we still have 2
> possible OPPs. The CFS task asks for compute capacity that fits in
> OPP1 but a part of this capacity will be stolen by RT tasks. If we
> monitor the load of RT tasks and request capacity for these RT tasks
> according to their current utilization, we can decide to switch to
> highest OPP2 to ensure that task C will have enough remaining
> capacity. A lot of embedded platform faces such kind of use cases

Still doesn't make sense. How would you know the constraint of RT task
A, and that it cannot be satisfied by OPP0 ? The only information you
have in the task model is a static priority.

The only possible choice the kernel has at this point is max OPP. It
doesn't have enough (_any_) information about worst case execution of
that task.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-03-01 13:58                         ` Peter Zijlstra
@ 2016-03-01 14:17                           ` Juri Lelli
  2016-03-01 14:24                             ` Peter Zijlstra
  2016-03-01 14:58                           ` Vincent Guittot
  1 sibling, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-03-01 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Steve Muckle, Rafael J. Wysocki,
	Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Thomas Gleixner

On 01/03/16 14:58, Peter Zijlstra wrote:
> On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:
> 
> > Another point to take into account is that the RT tasks will "steal"
> > the compute capacity that has been requested by the cfs tasks.
> > 
> > Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
> > and B and 1 cfs task C.
> 
> > Let assume that the real time constraint of RT task A is too agressive
> > for the lowest OPP0 and that the change of the frequency of the core
> > is too slow compare to this constraint but the real time constraint of
> > RT task B can be handle whatever the OPP. System don't have other
> > choice than setting the cpufreq min freq to OPP1 to be sure that
> > constraint of task A will be covered at anytime.
> 
> > Then, we still have 2
> > possible OPPs. The CFS task asks for compute capacity that fits in
> > OPP1 but a part of this capacity will be stolen by RT tasks. If we
> > monitor the load of RT tasks and request capacity for these RT tasks
> > according to their current utilization, we can decide to switch to
> > highest OPP2 to ensure that task C will have enough remaining
> > capacity. A lot of embedded platform faces such kind of use cases
> 
> Still doesn't make sense. How would you know the constraint of RT task
> A, and that it cannot be satisfied by OPP0 ? The only information you
> have in the task model is a static priority.
> 

But, can't we have the problem Vincent describes if we s/RT/DL/ ?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-03-01 14:17                           ` Juri Lelli
@ 2016-03-01 14:24                             ` Peter Zijlstra
  2016-03-01 14:26                               ` Peter Zijlstra
  0 siblings, 1 reply; 134+ messages in thread
From: Peter Zijlstra @ 2016-03-01 14:24 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Vincent Guittot, Steve Muckle, Rafael J. Wysocki,
	Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Thomas Gleixner

On Tue, Mar 01, 2016 at 02:17:06PM +0000, Juri Lelli wrote:
> On 01/03/16 14:58, Peter Zijlstra wrote:
> > On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:
> > 
> > > Another point to take into account is that the RT tasks will "steal"
> > > the compute capacity that has been requested by the cfs tasks.
> > > 
> > > Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
> > > and B and 1 cfs task C.
> > 
> > > Let assume that the real time constraint of RT task A is too agressive
> > > for the lowest OPP0 and that the change of the frequency of the core
> > > is too slow compare to this constraint but the real time constraint of
> > > RT task B can be handle whatever the OPP. System don't have other
> > > choice than setting the cpufreq min freq to OPP1 to be sure that
> > > constraint of task A will be covered at anytime.
> > 
> > > Then, we still have 2
> > > possible OPPs. The CFS task asks for compute capacity that fits in
> > > OPP1 but a part of this capacity will be stolen by RT tasks. If we
> > > monitor the load of RT tasks and request capacity for these RT tasks
> > > according to their current utilization, we can decide to switch to
> > > highest OPP2 to ensure that task C will have enough remaining
> > > capacity. A lot of embedded platform faces such kind of use cases
> > 
> > Still doesn't make sense. How would you know the constraint of RT task
> > A, and that it cannot be satisfied by OPP0 ? The only information you
> > have in the task model is a static priority.
> > 
> 
> But, can't we have the problem Vincent describes if we s/RT/DL/ ?

Still not sure I actually see a problem. With DL you have a minimal OPP
required to guarantee correct execution of the DL tasks. For CFS you
have an average util reflecting its workload.

Add the two and you've got an effective OPP request. Or in CPPC terms:
we request a min freq of the DL and a max freq of DL+avg_CFS.

We could probably improve upon that by also tracking an avg DL and
lowering the max freq request to: min(DL, avg_DL + avg_CFS). The
consequence is that when the DL tasks hit peaks (over their avg) the CFS
tasks get a little more delay. But this might be a worthwhile trade-off.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-03-01 14:24                             ` Peter Zijlstra
@ 2016-03-01 14:26                               ` Peter Zijlstra
  2016-03-01 14:42                                 ` Juri Lelli
  0 siblings, 1 reply; 134+ messages in thread
From: Peter Zijlstra @ 2016-03-01 14:26 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Vincent Guittot, Steve Muckle, Rafael J. Wysocki,
	Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Thomas Gleixner

On Tue, Mar 01, 2016 at 03:24:59PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 01, 2016 at 02:17:06PM +0000, Juri Lelli wrote:
> > On 01/03/16 14:58, Peter Zijlstra wrote:
> > > On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:
> > > 
> > > > Another point to take into account is that the RT tasks will "steal"
> > > > the compute capacity that has been requested by the cfs tasks.
> > > > 
> > > > Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
> > > > and B and 1 cfs task C.
> > > 
> > > > Let assume that the real time constraint of RT task A is too agressive
> > > > for the lowest OPP0 and that the change of the frequency of the core
> > > > is too slow compare to this constraint but the real time constraint of
> > > > RT task B can be handle whatever the OPP. System don't have other
> > > > choice than setting the cpufreq min freq to OPP1 to be sure that
> > > > constraint of task A will be covered at anytime.
> > > 
> > > > Then, we still have 2
> > > > possible OPPs. The CFS task asks for compute capacity that fits in
> > > > OPP1 but a part of this capacity will be stolen by RT tasks. If we
> > > > monitor the load of RT tasks and request capacity for these RT tasks
> > > > according to their current utilization, we can decide to switch to
> > > > highest OPP2 to ensure that task C will have enough remaining
> > > > capacity. A lot of embedded platform faces such kind of use cases
> > > 
> > > Still doesn't make sense. How would you know the constraint of RT task
> > > A, and that it cannot be satisfied by OPP0 ? The only information you
> > > have in the task model is a static priority.
> > > 
> > 
> > But, can't we have the problem Vincent describes if we s/RT/DL/ ?
> 
> Still not sure I actually see a problem. With DL you have a minimal OPP
> required to guarantee correct execution of the DL tasks. For CFS you
> have an average util reflecting its workload.
> 
> Add the two and you've got an effective OPP request. Or in CPPC terms:
> we request a min freq of the DL and a max freq of DL+avg_CFS.
> 
> We could probably improve upon that by also tracking an avg DL and
> lowering the max freq request to: min(DL, avg_DL + avg_CFS). The

max(DL, avg_DL + avg_CFS) obviously! ;-)

> consequence is that when the DL tasks hit peaks (over their avg) the CFS
> tasks get a little more delay. But this might be a worthwhile trade-off.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-03-01 14:26                               ` Peter Zijlstra
@ 2016-03-01 14:42                                 ` Juri Lelli
  2016-03-01 15:04                                   ` Peter Zijlstra
  0 siblings, 1 reply; 134+ messages in thread
From: Juri Lelli @ 2016-03-01 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Steve Muckle, Rafael J. Wysocki,
	Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Thomas Gleixner

On 01/03/16 15:26, Peter Zijlstra wrote:
> On Tue, Mar 01, 2016 at 03:24:59PM +0100, Peter Zijlstra wrote:
> > On Tue, Mar 01, 2016 at 02:17:06PM +0000, Juri Lelli wrote:
> > > On 01/03/16 14:58, Peter Zijlstra wrote:
> > > > On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:
> > > > 
> > > > > Another point to take into account is that the RT tasks will "steal"
> > > > > the compute capacity that has been requested by the cfs tasks.
> > > > > 
> > > > > Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
> > > > > and B and 1 cfs task C.
> > > > 
> > > > > Let assume that the real time constraint of RT task A is too agressive
> > > > > for the lowest OPP0 and that the change of the frequency of the core
> > > > > is too slow compare to this constraint but the real time constraint of
> > > > > RT task B can be handle whatever the OPP. System don't have other
> > > > > choice than setting the cpufreq min freq to OPP1 to be sure that
> > > > > constraint of task A will be covered at anytime.
> > > > 
> > > > > Then, we still have 2
> > > > > possible OPPs. The CFS task asks for compute capacity that fits in
> > > > > OPP1 but a part of this capacity will be stolen by RT tasks. If we
> > > > > monitor the load of RT tasks and request capacity for these RT tasks
> > > > > according to their current utilization, we can decide to switch to
> > > > > highest OPP2 to ensure that task C will have enough remaining
> > > > > capacity. A lot of embedded platform faces such kind of use cases
> > > > 
> > > > Still doesn't make sense. How would you know the constraint of RT task
> > > > A, and that it cannot be satisfied by OPP0 ? The only information you
> > > > have in the task model is a static priority.
> > > > 
> > > 
> > > But, can't we have the problem Vincent describes if we s/RT/DL/ ?
> > 
> > Still not sure I actually see a problem. With DL you have a minimal OPP
> > required to guarantee correct execution of the DL tasks. For CFS you
> > have an average util reflecting its workload.
> > 
> > Add the two and you've got an effective OPP request. Or in CPPC terms:
> > we request a min freq of the DL and a max freq of DL+avg_CFS.
> > 
> > We could probably improve upon that by also tracking an avg DL and
> > lowering the max freq request to: min(DL, avg_DL + avg_CFS). The
> 
> max(DL, avg_DL + avg_CFS) obviously! ;-)
> 
> > consequence is that when the DL tasks hit peaks (over their avg) the CFS
> > tasks get a little more delay. But this might be a worthwhile trade-off.
> 

Agree. My point was actually more about Rafael's schedutil RFC (I should
probably have posted this there, but I thought it fitted well with this
example). I realize that Rafael is starting simple, but I fear that some
aggregation of util coming from the different classes will be needed in
the end; schedfreq has already something along this line.

IMHO, the general approach would be that every scheduling class has an
interface to communicate its util requirement. Then RT will probably
have to ask for max, but CFS and DL will do better.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-03-01 13:58                         ` Peter Zijlstra
  2016-03-01 14:17                           ` Juri Lelli
@ 2016-03-01 14:58                           ` Vincent Guittot
  1 sibling, 0 replies; 134+ messages in thread
From: Vincent Guittot @ 2016-03-01 14:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Steve Muckle, Rafael J. Wysocki, Rafael J. Wysocki,
	Linux PM list, Linux Kernel Mailing List, Srinivas Pandruvada,
	Viresh Kumar, Thomas Gleixner

On 1 March 2016 at 14:58, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:
>
>> Another point to take into account is that the RT tasks will "steal"
>> the compute capacity that has been requested by the cfs tasks.
>>
>> Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
>> and B and 1 cfs task C.
>
>> Let assume that the real time constraint of RT task A is too agressive
>> for the lowest OPP0 and that the change of the frequency of the core
>> is too slow compare to this constraint but the real time constraint of
>> RT task B can be handle whatever the OPP. System don't have other
>> choice than setting the cpufreq min freq to OPP1 to be sure that
>> constraint of task A will be covered at anytime.
>
>> Then, we still have 2
>> possible OPPs. The CFS task asks for compute capacity that fits in
>> OPP1 but a part of this capacity will be stolen by RT tasks. If we
>> monitor the load of RT tasks and request capacity for these RT tasks
>> according to their current utilization, we can decide to switch to
>> highest OPP2 to ensure that task C will have enough remaining
>> capacity. A lot of embedded platform faces such kind of use cases
>
> Still doesn't make sense. How would you know the constraint of RT task
> A, and that it cannot be satisfied by OPP0 ? The only information you
> have in the task model is a static priority.

The kernel doesn't have this information so that's why the sysfs
cpufreq/scaling_min_freq has to be used to prevent the kernel (and
cpufreq in particular) to use OPP0.
>From a kernel/sched/cpufreq pov, we assume that all OPPs above
cpufreq/scaling_min can be used with RT tasks of the system. And
performance governor is used if only highest OPP can be used.

>
> The only possible choice the kernel has at this point is max OPP. It
> doesn't have enough (_any_) information about worst case execution of
> that task.
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-03-01 14:42                                 ` Juri Lelli
@ 2016-03-01 15:04                                   ` Peter Zijlstra
  2016-03-01 19:49                                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Peter Zijlstra @ 2016-03-01 15:04 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Vincent Guittot, Steve Muckle, Rafael J. Wysocki,
	Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Thomas Gleixner

On Tue, Mar 01, 2016 at 02:42:10PM +0000, Juri Lelli wrote:
> Agree. My point was actually more about Rafael's schedutil RFC (I should
> probably have posted this there, but I thought it fitted well with this
> example). I realize that Rafael is starting simple, but I fear that some
> aggregation of util coming from the different classes will be needed in
> the end; schedfreq has already something along this line.

Right, but I'm not sure that's a hard thing to add. But yes, it needs
doing.

It also very much has a bearing on the OPP state selection. As already
pointed out, the nearest OPP thing Rafael did is just wrong for DL.

It probably makes sense to pass a CPPC like form into the (software) OPP
selector.

> IMHO, the general approach would be that every scheduling class has an
> interface to communicate its util requirement. Then RT will probably
> have to ask for max, but CFS and DL will do better.

Right, so on IRC you mentioned that we could also use the global (or
cgroup) RT throttle to lower the RT util/OPP.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks
  2016-03-01 15:04                                   ` Peter Zijlstra
@ 2016-03-01 19:49                                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-03-01 19:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Steve Muckle, Rafael J. Wysocki,
	Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Thomas Gleixner

On Tue, Mar 1, 2016 at 4:04 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Mar 01, 2016 at 02:42:10PM +0000, Juri Lelli wrote:
>> Agree. My point was actually more about Rafael's schedutil RFC (I should
>> probably have posted this there, but I thought it fitted well with this
>> example). I realize that Rafael is starting simple, but I fear that some
>> aggregation of util coming from the different classes will be needed in
>> the end; schedfreq has already something along this line.
>
> Right, but I'm not sure that's a hard thing to add. But yes, it needs
> doing.
>
> It also very much has a bearing on the OPP state selection. As already
> pointed out, the nearest OPP thing Rafael did is just wrong for DL.
>
> It probably makes sense to pass a CPPC like form into the (software) OPP
> selector.
>
>> IMHO, the general approach would be that every scheduling class has an
>> interface to communicate its util requirement. Then RT will probably
>> have to ask for max, but CFS and DL will do better.
>
> Right, so on IRC you mentioned that we could also use the global (or
> cgroup) RT throttle to lower the RT util/OPP.

The current code simply treats RT/DL as "uknknown" and will always ask
for the max for them.  That should work, although it's suboptimal for
DL at least.  However, I'd prefer to add something more sophisticated
on top of it just to keep things simple to start with.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-24  2:01                         ` Rafael J. Wysocki
@ 2016-03-08 19:24                           ` Michael Turquette
  2016-03-08 20:40                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 134+ messages in thread
From: Michael Turquette @ 2016-03-08 19:24 UTC (permalink / raw)
  To: Rafael J. Wysocki, Juri Lelli
  Cc: Rafael J. Wysocki, Linux PM list, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Steve Muckle, Thomas Gleixner

Quoting Rafael J. Wysocki (2016-02-23 18:01:06)
> On Tuesday, February 23, 2016 11:01:18 AM Juri Lelli wrote:
> > On 22/02/16 22:26, Rafael J. Wysocki wrote:
> > > On Mon, Feb 22, 2016 at 10:32 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > > > On 19/02/16 23:14, Rafael J. Wysocki wrote:
> > > >> On Friday, February 19, 2016 08:09:17 AM Juri Lelli wrote:
> > > >> > Hi Rafael,
> > > >> >
> > > >> > On 18/02/16 21:22, Rafael J. Wysocki wrote:
> > > >> > > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > >> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >> > > >
> > > 
> > > [cut]
> > > 
> > > >> That said, if the concern is that there are plans to change the way the
> > > >> scheduler computes the utilization numbers and that may become difficult to
> > > >> carry out if cpufreq starts to depend on them in their current form, then I
> > > >> may agree that it is valid, but I'm not aware of those plans ATM.
> > > >>
> > > >
> > > > No, I don't think there's any substantial discussion going on about the
> > > > utilization numbers.
> > > 
> > > OK, so the statement below applies.
> > > 
> > > >> However, if the numbers are going to stay what they are, I don't see why
> > > >> passing them to cpufreq may possibly become problematic at any point.
> > > >
> > > > My concern was mostly on the fact that there is already another RFC
> > > > under discussion that uses the same numbers and has different hooks
> > > > placed in scheduler code (Steve's sched-freq); so, additional hooks
> > > > might generate confusion, IMHO.
> > > 
> > > So this is about the hooks rather than about their arguments after
> > > all, isn't it?
> > > 
> > > I fail to see why it is better to drop the arguments and leave the hooks, then.
> > > 
> > 
> > It's about where we place such hooks and what arguments they have.
> > Without the schedutil governor as a consumer the current position makes
> > sense, but some of the arguments are not used. With schedutil both
> > position and arguments make sense, but a different implementation
> > (sched-freq) might have different needs w.r.t. position and arguments.
> 
> And that's fine.  If the current position and/or arguments are not suitable,
> they'll need to be changed.  It's not like things introduced today are set
> in stone forever.
> 
> Peter has already shown how they may be changed to make everyone happy,
> so I don't really see what the fuss is about.

I see this patch in linux-next now. Did it ever get Peter's or Ingo's
Ack?

Also it seems weird to me that this patch touching sched code is going
through the pm tree.

When it comes times to experiment more with the interfaces and make the
"future changes" that everyone keeps talking about, who is the
maintainer? Who has the last word?

Regards,
Mike

> 
> > > OTOH, I see reasons for keeping the arguments along with the hooks,
> > > but let me address that in my next reply.
> > > 
> > > Now, if the call sites of the hooks change in the future, it won't be
> > > a problem for me as long as the new hooks are invoked on a regular
> > > basis or, if they aren't, as long as I can figure out from the
> > > arguments they pass that I should not expect an update any time soon.
> > > 
> > 
> > OK.
> > 
> > > If the arguments change, it won't be a problem either as long as they
> > > are sufficient to be inserted into the frequency selection formula
> > > used by the schedutil governor I posted and produce sensible
> > > frequencies for the CPU.
> > > 
> > 
> > Right, I guess this applies to any kind of governor.
> 
> Sure, but this particular formula is very simple.  It just assumes that
> util <= max so dividing the former by the latter will always yield a number
> between 0 and 1.  [And the interpretation of util > max is totally arbitrary
> today and regarded as temporary anyway, so that's just irrelevant.]
> 
> Thanks,
> Rafael
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-03-08 19:24                           ` Michael Turquette
@ 2016-03-08 20:40                             ` Rafael J. Wysocki
       [not found]                               ` <20160308220632.4103.13377@quark.deferred.io>
  0 siblings, 1 reply; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-03-08 20:40 UTC (permalink / raw)
  To: Michael Turquette
  Cc: Rafael J. Wysocki, Juri Lelli, Rafael J. Wysocki, Linux PM list,
	Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Steve Muckle, Thomas Gleixner

On Tue, Mar 8, 2016 at 8:24 PM, Michael Turquette
<mturquette@baylibre.com> wrote:
> Quoting Rafael J. Wysocki (2016-02-23 18:01:06)
>> On Tuesday, February 23, 2016 11:01:18 AM Juri Lelli wrote:
>> > On 22/02/16 22:26, Rafael J. Wysocki wrote:
>> > > On Mon, Feb 22, 2016 at 10:32 AM, Juri Lelli <juri.lelli@arm.com> wrote:
>> > > > On 19/02/16 23:14, Rafael J. Wysocki wrote:
>> > > >> On Friday, February 19, 2016 08:09:17 AM Juri Lelli wrote:
>> > > >> > Hi Rafael,
>> > > >> >
>> > > >> > On 18/02/16 21:22, Rafael J. Wysocki wrote:
>> > > >> > > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> > > >> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> > > >> > > >
>> > >
>> > > [cut]
>> > >
>> > > >> That said, if the concern is that there are plans to change the way the
>> > > >> scheduler computes the utilization numbers and that may become difficult to
>> > > >> carry out if cpufreq starts to depend on them in their current form, then I
>> > > >> may agree that it is valid, but I'm not aware of those plans ATM.
>> > > >>
>> > > >
>> > > > No, I don't think there's any substantial discussion going on about the
>> > > > utilization numbers.
>> > >
>> > > OK, so the statement below applies.
>> > >
>> > > >> However, if the numbers are going to stay what they are, I don't see why
>> > > >> passing them to cpufreq may possibly become problematic at any point.
>> > > >
>> > > > My concern was mostly on the fact that there is already another RFC
>> > > > under discussion that uses the same numbers and has different hooks
>> > > > placed in scheduler code (Steve's sched-freq); so, additional hooks
>> > > > might generate confusion, IMHO.
>> > >
>> > > So this is about the hooks rather than about their arguments after
>> > > all, isn't it?
>> > >
>> > > I fail to see why it is better to drop the arguments and leave the hooks, then.
>> > >
>> >
>> > It's about where we place such hooks and what arguments they have.
>> > Without the schedutil governor as a consumer the current position makes
>> > sense, but some of the arguments are not used. With schedutil both
>> > position and arguments make sense, but a different implementation
>> > (sched-freq) might have different needs w.r.t. position and arguments.
>>
>> And that's fine.  If the current position and/or arguments are not suitable,
>> they'll need to be changed.  It's not like things introduced today are set
>> in stone forever.
>>
>> Peter has already shown how they may be changed to make everyone happy,
>> so I don't really see what the fuss is about.
>
> I see this patch in linux-next now. Did it ever get Peter's or Ingo's
> Ack?

No, but none of them said "no" either.

And the interface was suggested by Peter in the first place.

> Also it seems weird to me that this patch touching sched code is going
> through the pm tree.

That's for purely practical reasons.  There are lots of PM changes
depending on it that have nothing to do with the scheduler.  I've been
talking about that for several times now, last time in my yesterday
post (http://marc.info/?l=linux-pm&m=145740561402948&w=2).  I've been
talking openly about what I'm going to do with this all the time.

No one is hiding things from anyone or trying to slip them through
past somebody here if that's what you're worried about.

> When it comes times to experiment more with the interfaces and make the
> "future changes" that everyone keeps talking about, who is the
> maintainer? Who has the last word?

As usual, it is about consensus.

This is on a boundary of two subsystems and I have good reasons to do
it.  One of the maintainers of the other subsystem involved is working
with me all the time and I'm following his suggestions.  Isn't that
really sufficient?

But really please see
http://marc.info/?l=linux-pm&m=145740561402948&w=2 as it means I'm
actually going to do what Juri and Steve asked for unless I'm told
that this is a bad idea.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
       [not found]                               ` <20160308220632.4103.13377@quark.deferred.io>
@ 2016-03-08 22:43                                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-03-08 22:43 UTC (permalink / raw)
  To: Michael Turquette
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Juri Lelli, Linux PM list,
	Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Steve Muckle, Thomas Gleixner

On Tue, Mar 8, 2016 at 11:06 PM, Michael Turquette
<mturquette@baylibre.com> wrote:
> Quoting Rafael J. Wysocki (2016-03-08 12:40:18)
>> On Tue, Mar 8, 2016 at 8:24 PM, Michael Turquette
>> <mturquette@baylibre.com> wrote:
>> > Quoting Rafael J. Wysocki (2016-02-23 18:01:06)
>> >> On Tuesday, February 23, 2016 11:01:18 AM Juri Lelli wrote:
>> >> > On 22/02/16 22:26, Rafael J. Wysocki wrote:
>> >> > > On Mon, Feb 22, 2016 at 10:32 AM, Juri Lelli <juri.lelli@arm.com> wrote:
>> >> > > > On 19/02/16 23:14, Rafael J. Wysocki wrote:
>> >> > > >> On Friday, February 19, 2016 08:09:17 AM Juri Lelli wrote:
>> >> > > >> > Hi Rafael,
>> >> > > >> >
>> >> > > >> > On 18/02/16 21:22, Rafael J. Wysocki wrote:
>> >> > > >> > > On Mon, Feb 15, 2016 at 10:47 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> >> > > >> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> >> > > >> > > >
>> >> > >
>> >> > > [cut]
>> >> > >
>> >> > > >> That said, if the concern is that there are plans to change the way the
>> >> > > >> scheduler computes the utilization numbers and that may become difficult to
>> >> > > >> carry out if cpufreq starts to depend on them in their current form, then I
>> >> > > >> may agree that it is valid, but I'm not aware of those plans ATM.
>> >> > > >>
>> >> > > >
>> >> > > > No, I don't think there's any substantial discussion going on about the
>> >> > > > utilization numbers.
>> >> > >
>> >> > > OK, so the statement below applies.
>> >> > >
>> >> > > >> However, if the numbers are going to stay what they are, I don't see why
>> >> > > >> passing them to cpufreq may possibly become problematic at any point.
>> >> > > >
>> >> > > > My concern was mostly on the fact that there is already another RFC
>> >> > > > under discussion that uses the same numbers and has different hooks
>> >> > > > placed in scheduler code (Steve's sched-freq); so, additional hooks
>> >> > > > might generate confusion, IMHO.
>> >> > >
>> >> > > So this is about the hooks rather than about their arguments after
>> >> > > all, isn't it?
>> >> > >
>> >> > > I fail to see why it is better to drop the arguments and leave the hooks, then.
>> >> > >
>> >> >
>> >> > It's about where we place such hooks and what arguments they have.
>> >> > Without the schedutil governor as a consumer the current position makes
>> >> > sense, but some of the arguments are not used. With schedutil both
>> >> > position and arguments make sense, but a different implementation
>> >> > (sched-freq) might have different needs w.r.t. position and arguments.
>> >>
>> >> And that's fine.  If the current position and/or arguments are not suitable,
>> >> they'll need to be changed.  It's not like things introduced today are set
>> >> in stone forever.
>> >>
>> >> Peter has already shown how they may be changed to make everyone happy,
>> >> so I don't really see what the fuss is about.
>> >
>> > I see this patch in linux-next now. Did it ever get Peter's or Ingo's
>> > Ack?
>>
>> No, but none of them said "no" either.
>>
>> And the interface was suggested by Peter in the first place.
>>
>> > Also it seems weird to me that this patch touching sched code is going
>> > through the pm tree.
>>
>> That's for purely practical reasons.  There are lots of PM changes
>> depending on it that have nothing to do with the scheduler.  I've been
>> talking about that for several times now, last time in my yesterday
>> post (http://marc.info/?l=linux-pm&m=145740561402948&w=2).  I've been
>> talking openly about what I'm going to do with this all the time.
>>
>> No one is hiding things from anyone or trying to slip them through
>> past somebody here if that's what you're worried about.
>>
>> > When it comes times to experiment more with the interfaces and make the
>> > "future changes" that everyone keeps talking about, who is the
>> > maintainer? Who has the last word?
>>
>> As usual, it is about consensus.
>
> To be fair, that consensus should be recorded formally by Reviewed-by
> and Acked-by tags.

I would feel much more comfortable with ACKs on the commits touching
the scheduler code, no question about that. :-)

That said, if another maintainer makes PM-related or ACPI-related
changes and follows my suggestions all the way through, I may not feel
like I have to ACK all of that every time.  After all, it all boils
down to what happens to the pull request eventually and Acked-by tags
may or may not help there.

>>
>> This is on a boundary of two subsystems and I have good reasons to do
>> it.  One of the maintainers of the other subsystem involved is working
>> with me all the time and I'm following his suggestions.  Isn't that
>> really sufficient?
>>
>> But really please see
>> http://marc.info/?l=linux-pm&m=145740561402948&w=2 as it means I'm
>> actually going to do what Juri and Steve asked for unless I'm told
>> that this is a bad idea.
>
> I'll take a look. Note that Steve, Juri and Vincent are all at a
> conference this week so their responses may be slow.

That's fine.

I'm not going to send new versions of the patches any time soon
(unless somebody points out a problem to fix in them to me).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-02-15 21:47           ` [PATCH v10 " Rafael J. Wysocki
  2016-02-18 20:22             ` Rafael J. Wysocki
@ 2016-03-09 12:35             ` Peter Zijlstra
  2016-03-09 13:22               ` Rafael J. Wysocki
                                 ` (2 more replies)
  1 sibling, 3 replies; 134+ messages in thread
From: Peter Zijlstra @ 2016-03-09 12:35 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM list, Ingo Molnar, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Steve Muckle,
	Thomas Gleixner

On Mon, Feb 15, 2016 at 10:47:22PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Introduce a mechanism by which parts of the cpufreq subsystem
> ("setpolicy" drivers or the core) can register callbacks to be
> executed from cpufreq_update_util() which is invoked by the
> scheduler's update_load_avg() on CPU utilization changes.
> 
> This allows the "setpolicy" drivers to dispense with their timers
> and do all of the computations they need and frequency/voltage
> adjustments in the update_load_avg() code path, among other things.
> 
> The update_load_avg() changes were suggested by Peter Zijlstra.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  drivers/cpufreq/cpufreq.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/cpufreq.h   |   34 ++++++++++++++++++++++++++++++++++
>  kernel/sched/deadline.c   |    4 ++++
>  kernel/sched/fair.c       |   26 +++++++++++++++++++++++++-
>  kernel/sched/rt.c         |    4 ++++
>  kernel/sched/sched.h      |    1 +
>  6 files changed, 113 insertions(+), 1 deletion(-)
> 

So with the understanding that we'll work on getting rid of
cpufreq_trigger_update().

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Also, Vincent had some concerns about the exact placement of the
callback, and I see no problem in moving it if there's need.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-03-09 12:35             ` Peter Zijlstra
@ 2016-03-09 13:22               ` Rafael J. Wysocki
  2016-03-09 13:32               ` Ingo Molnar
  2016-03-10  2:12               ` Vincent Guittot
  2 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-03-09 13:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Linux PM list, Ingo Molnar,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Steve Muckle, Thomas Gleixner

On Wed, Mar 9, 2016 at 1:35 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Feb 15, 2016 at 10:47:22PM +0100, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Introduce a mechanism by which parts of the cpufreq subsystem
>> ("setpolicy" drivers or the core) can register callbacks to be
>> executed from cpufreq_update_util() which is invoked by the
>> scheduler's update_load_avg() on CPU utilization changes.
>>
>> This allows the "setpolicy" drivers to dispense with their timers
>> and do all of the computations they need and frequency/voltage
>> adjustments in the update_load_avg() code path, among other things.
>>
>> The update_load_avg() changes were suggested by Peter Zijlstra.
>>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
>> ---
>>  drivers/cpufreq/cpufreq.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/cpufreq.h   |   34 ++++++++++++++++++++++++++++++++++
>>  kernel/sched/deadline.c   |    4 ++++
>>  kernel/sched/fair.c       |   26 +++++++++++++++++++++++++-
>>  kernel/sched/rt.c         |    4 ++++
>>  kernel/sched/sched.h      |    1 +
>>  6 files changed, 113 insertions(+), 1 deletion(-)
>>
>
> So with the understanding that we'll work on getting rid of
> cpufreq_trigger_update().

That definitely is the plan.

> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks! :-)

> Also, Vincent had some concerns about the exact placement of the
> callback, and I see no problem in moving it if there's need.

Yup, same here.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-03-09 12:35             ` Peter Zijlstra
  2016-03-09 13:22               ` Rafael J. Wysocki
@ 2016-03-09 13:32               ` Ingo Molnar
  2016-03-09 13:39                 ` Rafael J. Wysocki
  2016-03-10  2:12               ` Vincent Guittot
  2 siblings, 1 reply; 134+ messages in thread
From: Ingo Molnar @ 2016-03-09 13:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Linux PM list, Linux Kernel Mailing List,
	Srinivas Pandruvada, Viresh Kumar, Juri Lelli, Steve Muckle,
	Thomas Gleixner


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Feb 15, 2016 at 10:47:22PM +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > Introduce a mechanism by which parts of the cpufreq subsystem
> > ("setpolicy" drivers or the core) can register callbacks to be
> > executed from cpufreq_update_util() which is invoked by the
> > scheduler's update_load_avg() on CPU utilization changes.
> > 
> > This allows the "setpolicy" drivers to dispense with their timers
> > and do all of the computations they need and frequency/voltage
> > adjustments in the update_load_avg() code path, among other things.
> > 
> > The update_load_avg() changes were suggested by Peter Zijlstra.
> > 
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
> > ---
> >  drivers/cpufreq/cpufreq.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/cpufreq.h   |   34 ++++++++++++++++++++++++++++++++++
> >  kernel/sched/deadline.c   |    4 ++++
> >  kernel/sched/fair.c       |   26 +++++++++++++++++++++++++-
> >  kernel/sched/rt.c         |    4 ++++
> >  kernel/sched/sched.h      |    1 +
> >  6 files changed, 113 insertions(+), 1 deletion(-)
> > 
> 
> So with the understanding that we'll work on getting rid of
> cpufreq_trigger_update().
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I'm happy with the latest iteration and with the general direction as well!

Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-03-09 13:32               ` Ingo Molnar
@ 2016-03-09 13:39                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 134+ messages in thread
From: Rafael J. Wysocki @ 2016-03-09 13:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Rafael J. Wysocki, Linux PM list,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Steve Muckle, Thomas Gleixner

On Wed, Mar 9, 2016 at 2:32 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Mon, Feb 15, 2016 at 10:47:22PM +0100, Rafael J. Wysocki wrote:
>> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> >
>> > Introduce a mechanism by which parts of the cpufreq subsystem
>> > ("setpolicy" drivers or the core) can register callbacks to be
>> > executed from cpufreq_update_util() which is invoked by the
>> > scheduler's update_load_avg() on CPU utilization changes.
>> >
>> > This allows the "setpolicy" drivers to dispense with their timers
>> > and do all of the computations they need and frequency/voltage
>> > adjustments in the update_load_avg() code path, among other things.
>> >
>> > The update_load_avg() changes were suggested by Peter Zijlstra.
>> >
>> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> > Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
>> > ---
>> >  drivers/cpufreq/cpufreq.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
>> >  include/linux/cpufreq.h   |   34 ++++++++++++++++++++++++++++++++++
>> >  kernel/sched/deadline.c   |    4 ++++
>> >  kernel/sched/fair.c       |   26 +++++++++++++++++++++++++-
>> >  kernel/sched/rt.c         |    4 ++++
>> >  kernel/sched/sched.h      |    1 +
>> >  6 files changed, 113 insertions(+), 1 deletion(-)
>> >
>>
>> So with the understanding that we'll work on getting rid of
>> cpufreq_trigger_update().
>>
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> I'm happy with the latest iteration and with the general direction as well!
>
> Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks a lot!

Rafael

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v10 1/3] cpufreq: Add mechanism for registering utilization update callbacks
  2016-03-09 12:35             ` Peter Zijlstra
  2016-03-09 13:22               ` Rafael J. Wysocki
  2016-03-09 13:32               ` Ingo Molnar
@ 2016-03-10  2:12               ` Vincent Guittot
  2 siblings, 0 replies; 134+ messages in thread
From: Vincent Guittot @ 2016-03-10  2:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Linux PM list, Ingo Molnar,
	Linux Kernel Mailing List, Srinivas Pandruvada, Viresh Kumar,
	Juri Lelli, Steve Muckle, Thomas Gleixner

On 9 March 2016 at 19:35, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Feb 15, 2016 at 10:47:22PM +0100, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Introduce a mechanism by which parts of the cpufreq subsystem
>> ("setpolicy" drivers or the core) can register callbacks to be
>> executed from cpufreq_update_util() which is invoked by the
>> scheduler's update_load_avg() on CPU utilization changes.
>>
>> This allows the "setpolicy" drivers to dispense with their timers
>> and do all of the computations they need and frequency/voltage
>> adjustments in the update_load_avg() code path, among other things.
>>
>> The update_load_avg() changes were suggested by Peter Zijlstra.
>>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
>> ---
>>  drivers/cpufreq/cpufreq.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/cpufreq.h   |   34 ++++++++++++++++++++++++++++++++++
>>  kernel/sched/deadline.c   |    4 ++++
>>  kernel/sched/fair.c       |   26 +++++++++++++++++++++++++-
>>  kernel/sched/rt.c         |    4 ++++
>>  kernel/sched/sched.h      |    1 +
>>  6 files changed, 113 insertions(+), 1 deletion(-)
>>
>
> So with the understanding that we'll work on getting rid of
> cpufreq_trigger_update().
>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> Also, Vincent had some concerns about the exact placement of the
> callback, and I see no problem in moving it if there's need.

Yes, as explained previously we can probably  use other placement to
not miss any immediate change of rq's utilization because of task
migration but this optimization can probably be done in a next step

> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 134+ messages in thread

end of thread, other threads:[~2016-03-10  2:12 UTC | newest]

Thread overview: 134+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-29 22:52 [PATCH 0/3] cpufreq: Replace timers with utilization update callbacks Rafael J. Wysocki
2016-01-29 22:53 ` [PATCH 1/3] cpufreq: Add a mechanism for registering " Rafael J. Wysocki
2016-02-04  3:31   ` Viresh Kumar
2016-01-29 22:56 ` [PATCH 2/3] cpufreq: intel_pstate: Replace timers with " Rafael J. Wysocki
2016-01-29 22:59 ` [PATCH 3/3] cpufreq: governor: " Rafael J. Wysocki
2016-02-03  1:16   ` [Update][PATCH " Rafael J. Wysocki
2016-02-04  4:49     ` Viresh Kumar
2016-02-04 10:54       ` Rafael J. Wysocki
2016-02-05  1:28     ` [PATCH 3/3 v3] " Rafael J. Wysocki
2016-02-05  6:50       ` Viresh Kumar
2016-02-05 13:36         ` Rafael J. Wysocki
2016-02-05 14:47           ` Viresh Kumar
2016-02-05 23:10             ` Rafael J. Wysocki
2016-02-07  9:10               ` Viresh Kumar
2016-02-07 14:43                 ` Rafael J. Wysocki
2016-02-08  2:08                   ` Rafael J. Wysocki
2016-02-08 11:52                     ` Viresh Kumar
2016-02-08 12:52                       ` Rafael J. Wysocki
2016-02-08 13:40                         ` Rafael J. Wysocki
2016-02-05 23:01           ` Rafael J. Wysocki
2016-02-06  3:40       ` [PATCH 3/3 v4] " Rafael J. Wysocki
2016-02-07  9:20         ` Viresh Kumar
2016-02-07 14:36           ` Rafael J. Wysocki
2016-02-07 14:50         ` [PATCH 3/3 v5] " Rafael J. Wysocki
2016-02-07 15:36           ` Viresh Kumar
2016-02-09 10:01           ` Gautham R Shenoy
2016-02-09 18:49             ` Rafael J. Wysocki
2016-02-03 22:20 ` [PATCH 0/3] cpufreq: " Rafael J. Wysocki
2016-02-04  0:08   ` Srinivas Pandruvada
2016-02-04 17:16     ` Rafael J. Wysocki
2016-02-04 10:51   ` Juri Lelli
2016-02-04 17:19     ` Rafael J. Wysocki
2016-02-08 23:06   ` Rafael J. Wysocki
2016-02-09  0:39     ` Steve Muckle
2016-02-09  1:01       ` Rafael J. Wysocki
2016-02-09 20:05         ` Rafael J. Wysocki
2016-02-10  1:02           ` Steve Muckle
2016-02-10  1:57             ` Rafael J. Wysocki
2016-02-10  3:09               ` Rafael J. Wysocki
2016-02-10 19:47                 ` Steve Muckle
2016-02-10 21:49                   ` Rafael J. Wysocki
2016-02-10 22:07                     ` Steve Muckle
2016-02-10 22:12                       ` Rafael J. Wysocki
2016-02-11 11:59             ` Peter Zijlstra
2016-02-11 12:24               ` Juri Lelli
2016-02-11 15:26                 ` Peter Zijlstra
2016-02-11 18:23                   ` Vincent Guittot
2016-02-12 14:04                     ` Peter Zijlstra
2016-02-12 14:48                       ` Vincent Guittot
2016-03-01 13:58                         ` Peter Zijlstra
2016-03-01 14:17                           ` Juri Lelli
2016-03-01 14:24                             ` Peter Zijlstra
2016-03-01 14:26                               ` Peter Zijlstra
2016-03-01 14:42                                 ` Juri Lelli
2016-03-01 15:04                                   ` Peter Zijlstra
2016-03-01 19:49                                     ` Rafael J. Wysocki
2016-03-01 14:58                           ` Vincent Guittot
2016-02-11 17:06               ` Steve Muckle
2016-02-11 17:30                 ` Peter Zijlstra
2016-02-11 17:34                   ` Rafael J. Wysocki
2016-02-11 17:38                     ` Peter Zijlstra
2016-02-11 18:52                   ` Steve Muckle
2016-02-11 19:04                     ` Rafael J. Wysocki
2016-02-12 13:43                       ` Rafael J. Wysocki
2016-02-12 14:10                     ` Peter Zijlstra
2016-02-12 16:01                       ` Rafael J. Wysocki
2016-02-12 16:15                         ` Rafael J. Wysocki
2016-02-12 16:53                           ` Ashwin Chaugule
2016-02-12 23:14                             ` Rafael J. Wysocki
2016-02-12 17:02                         ` Doug Smythies
2016-02-12 23:17                           ` Rafael J. Wysocki
2016-02-10 12:33           ` Juri Lelli
2016-02-10 13:23             ` Rafael J. Wysocki
2016-02-10 14:03               ` Juri Lelli
2016-02-10 14:26                 ` Rafael J. Wysocki
2016-02-10 14:46                   ` Juri Lelli
2016-02-10 15:46                     ` Rafael J. Wysocki
2016-02-10 16:05                       ` Juri Lelli
2016-02-11 11:51           ` Peter Zijlstra
2016-02-11 12:08             ` Rafael J. Wysocki
2016-02-11 15:29               ` Peter Zijlstra
2016-02-11 15:58                 ` Rafael J. Wysocki
2016-02-11 20:47               ` Rafael J. Wysocki
2016-02-10 15:17 ` [PATCH v6 " Rafael J. Wysocki
2016-02-10 15:21   ` [PATCH v6 1/3] cpufreq: Add mechanism for registering " Rafael J. Wysocki
2016-02-10 23:01     ` [PATCH v7 " Rafael J. Wysocki
2016-02-11 17:30       ` [PATCH v8 " Rafael J. Wysocki
2016-02-12 13:16         ` [PATCH v9 " Rafael J. Wysocki
2016-02-15 21:47           ` [PATCH v10 " Rafael J. Wysocki
2016-02-18 20:22             ` Rafael J. Wysocki
2016-02-19  8:09               ` Juri Lelli
2016-02-19 16:42                 ` Srinivas Pandruvada
2016-02-19 17:26                   ` Juri Lelli
2016-02-19 22:26                     ` Rafael J. Wysocki
2016-02-22  9:42                       ` Juri Lelli
2016-02-22 21:41                         ` Rafael J. Wysocki
2016-02-23 11:10                           ` Juri Lelli
2016-02-24  1:52                             ` Rafael J. Wysocki
2016-02-22 10:45                       ` Viresh Kumar
2016-02-19 17:28                   ` Steve Muckle
2016-02-19 22:35                     ` Rafael J. Wysocki
2016-02-23  3:58                       ` Steve Muckle
2016-02-22 10:52                     ` Peter Zijlstra
2016-02-22 14:33                       ` Vincent Guittot
2016-02-22 15:31                         ` Peter Zijlstra
2016-02-22 14:40                       ` Juri Lelli
2016-02-22 15:42                         ` Peter Zijlstra
2016-02-22 21:46                       ` Rafael J. Wysocki
2016-02-19 22:14                 ` Rafael J. Wysocki
2016-02-22  9:32                   ` Juri Lelli
2016-02-22 21:26                     ` Rafael J. Wysocki
2016-02-23 11:01                       ` Juri Lelli
2016-02-24  2:01                         ` Rafael J. Wysocki
2016-03-08 19:24                           ` Michael Turquette
2016-03-08 20:40                             ` Rafael J. Wysocki
     [not found]                               ` <20160308220632.4103.13377@quark.deferred.io>
2016-03-08 22:43                                 ` Rafael J. Wysocki
2016-03-09 12:35             ` Peter Zijlstra
2016-03-09 13:22               ` Rafael J. Wysocki
2016-03-09 13:32               ` Ingo Molnar
2016-03-09 13:39                 ` Rafael J. Wysocki
2016-03-10  2:12               ` Vincent Guittot
2016-02-10 15:25   ` [PATCH v6 2/3] cpufreq: intel_pstate: Replace timers with " Rafael J. Wysocki
2016-02-10 15:36   ` [PATCH v6 3/3] cpufreq: governor: " Rafael J. Wysocki
2016-02-10 23:11   ` [PATCH v6 0/3] cpufreq: " Doug Smythies
2016-02-10 23:17     ` Rafael J. Wysocki
2016-02-11 22:50       ` Doug Smythies
2016-02-11 23:28         ` Rafael J. Wysocki
2016-02-12  1:02           ` Doug Smythies
2016-02-12  1:20             ` Rafael J. Wysocki
2016-02-12  7:25         ` Doug Smythies
2016-02-12 13:39           ` Rafael J. Wysocki
2016-02-12 17:33             ` Doug Smythies
2016-02-12 23:21               ` Rafael J. Wysocki
2016-02-11  6:02     ` Srinivas Pandruvada

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).